Multi-Level ResNets with Stacked SRUs for Action Recognition

Convolutional Neural Networks(CNNs) have shown to be efficient for action recognition. However, most existing works are either low efficiency or hard to be optimized. Inspired by the fact that LSTM consistently makes breakthrough in the task related to sequence, we propose multiple level residual networks with stacked simple recurrent units(R-SRU) model that ResNets learn spatial representations from frame appearance and stacked SRUs learn temporal dynamics from video sequences, both spatially and temporally. We analyze the effect of diverse hyper-parameter settings qualitatively aiming at recommending researchers the better choice of hyper-parameters for using SRUs. Additionally, we compare low-, mid-, high-level representations of video frames extracted using pretrained ResNets and combine multi-level representations to pass it through SRUs with various time pooling manners after that, experimentally demonstrating how well different level features contribute to action recognition. Specifically, we are the first to apply SRU to distinguish actions and three independent models are trained end-to-end before we combined. A series of experiments are carried out on two standard benchmarks: HMDB-51 and UCF-101 dataset. Experimental results illustrate R-SRU outperforms the majority of methods which only take RGB data as input and obtain competitive performances with the state-of-the-art, achieving 51.31% on HMDB-51 and 81.38% on UCF-101.
View on arXiv