ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2108.09322
  4. Cited By
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action
  Recognition

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

20 August 2021
Jiawei Chen
C. Ho
    ViT
ArXivPDFHTML

Papers citing "MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition"

22 / 72 papers shown
Title
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Kensho Hara
Hirokatsu Kataoka
Y. Satoh
3DPC
89
1,926
0
27 Nov 2017
Appearance-and-Relation Networks for Video Classification
Appearance-and-Relation Networks for Video Classification
Limin Wang
Wei Li
Wen Li
Luc Van Gool
52
351
0
24 Nov 2017
Temporal Relational Reasoning in Videos
Temporal Relational Reasoning in Videos
Bolei Zhou
A. Andonian
Aude Oliva
Antonio Torralba
NAI
64
1,035
0
22 Nov 2017
Non-local Neural Networks
Non-local Neural Networks
Xinyu Wang
Ross B. Girshick
Abhinav Gupta
Kaiming He
OffRL
170
8,867
0
21 Nov 2017
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume
PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume
Deqing Sun
Xiaodong Yang
Ming-Yuan Liu
Jan Kautz
3DPC
222
2,435
0
07 Sep 2017
ConvNet Architecture Search for Spatiotemporal Feature Learning
ConvNet Architecture Search for Spatiotemporal Feature Learning
Du Tran
Jamie Ray
Zheng Shou
Shih-Fu Chang
Manohar Paluri
3DPC
49
383
0
16 Aug 2017
Tensor Fusion Network for Multimodal Sentiment Analysis
Tensor Fusion Network for Multimodal Sentiment Analysis
Amir Zadeh
Minghai Chen
Soujanya Poria
Min Zhang
Louis-Philippe Morency
41
1,221
0
23 Jul 2017
Skeleton-based Action Recognition Using LSTM and CNN
Skeleton-based Action Recognition Using LSTM and CNN
Chuankun Li
Pichao Wang
Shuang Wang
Yonghong Hou
W. Li
HAI
56
174
0
06 Jul 2017
The "something something" video database for learning and evaluating
  visual common sense
The "something something" video database for learning and evaluating visual common sense
Raghav Goyal
Samira Ebrahimi Kahou
Vincent Michalski
Joanna Materzynska
S. Westphal
...
Moritz Mueller-Freitag
F. Hoppe
Christian Thurau
Ingo Bax
Roland Memisevic
VLM
59
1,507
0
13 Jun 2017
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
278
129,831
0
12 Jun 2017
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
João Carreira
Andrew Zisserman
178
7,961
0
22 May 2017
The Kinetics Human Action Video Dataset
The Kinetics Human Action Video Dataset
W. Kay
João Carreira
Karen Simonyan
Brian Zhang
Chloe Hillier
...
Tim Green
T. Back
Apostol Natsev
Mustafa Suleyman
Andrew Zisserman
182
3,771
0
19 May 2017
ActionFlowNet: Learning Motion Representation for Action Recognition
ActionFlowNet: Learning Motion Representation for Action Recognition
Joe Yue-Hei Ng
Jonghyun Choi
J. Neumann
L. Davis
45
119
0
09 Dec 2016
Cross-Modal Scene Networks
Cross-Modal Scene Networks
Y. Aytar
Lluis Castrejon
Carl Vondrick
Hamed Pirsiavash
Antonio Torralba
SSL
27
114
0
27 Oct 2016
Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at
  Extremely Low Resolutions
Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions
Jiawei Chen
Jonathan Wu
Janusz Konrad
Prakash Ishwar
33
45
0
12 Oct 2016
Temporal Segment Networks: Towards Good Practices for Deep Action
  Recognition
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Limin Wang
Yuanjun Xiong
Zhe Wang
Yu Qiao
Dahua Lin
Xiaoou Tang
Luc Van Gool
ViT
78
3,814
0
02 Aug 2016
Layer Normalization
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
187
10,412
0
21 Jul 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and
  Visual Grounding
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui
Dong Huk Park
Daylen Yang
Anna Rohrbach
Trevor Darrell
Marcus Rohrbach
253
1,466
0
06 Jun 2016
Real-time Action Recognition with Enhanced Motion Vector CNNs
Real-time Action Recognition with Enhanced Motion Vector CNNs
Bowen Zhang
Limin Wang
Zhe Wang
Yu Qiao
Hanli Wang
60
417
0
26 Apr 2016
Convolutional Two-Stream Network Fusion for Video Action Recognition
Convolutional Two-Stream Network Fusion for Video Action Recognition
Christoph Feichtenhofer
A. Pinz
Andrew Zisserman
103
2,606
0
22 Apr 2016
Two-Stream Convolutional Networks for Action Recognition in Videos
Two-Stream Convolutional Networks for Action Recognition in Videos
Karen Simonyan
Andrew Zisserman
207
7,518
0
09 Jun 2014
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
K. Soomro
Amir Zamir
M. Shah
CLIP
VGen
65
6,100
0
03 Dec 2012
Previous
12