Multi-Level ResNets with Stacked SRUs for Action Recognition

22 November 2017

Abstract

Inspiring by the fact that the enormous breakthrough convolutional networks consistently make in image classification, while most existing works are either low efficiency or hard to be optimized, we propose multiple level residual networks with stacked simple recurrent units(R-SRU) model trained end-to-end that ResNets learn spatial information from frame appearances and stacked SRUs learn temporal dynamics from video sequences, both spatially and temporally. We investigate the effect of diverse hyper-parameter settings aiming at recommending researchers the better choice of hyper-parameters for using SRUs. Additionally, we compare low-, mid-, high-level features produced by ResNets and combine multi-level features to pass it through SRUs with various time pooling manners after that, experimentally demonstrating the extent of contribution of each level features to action recognition. Specifically, we are the first to apply SRU to distinguish actions. A series of experiments is carried out on two standard benchmarks: HMDB-51 and UCF-101 dataset. Experimental results illustrate that R-SRU outperforms the majority of methods which only take RGB data as input and obtain competitive performances with the state-of-the-art, achieving 51.31% on HMDB-51 and 81.38% on UCF-101.

View on arXiv

Comments on this paper