Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2108.09322
Cited By
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
20 August 2021
Jiawei Chen
C. Ho
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition"
50 / 72 papers shown
Title
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
74
4
0
10 Jun 2024
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
353
796
0
18 Apr 2021
ViViT: A Video Vision Transformer
Anurag Arnab
Mostafa Dehghani
G. Heigold
Chen Sun
Mario Lucic
Cordelia Schmid
ViT
84
2,119
0
29 Mar 2021
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Ze Liu
Yutong Lin
Yue Cao
Han Hu
Yixuan Wei
Zheng Zhang
Stephen Lin
B. Guo
ViT
203
21,051
0
25 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
510
28,659
0
26 Feb 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
301
2,016
0
09 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
226
430
0
01 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
91
113
0
31 Jan 2021
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
41
77
0
08 Dec 2020
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu
Weijie Su
Lewei Lu
Bin Li
Xiaogang Wang
Jifeng Dai
ViT
129
4,993
0
08 Oct 2020
Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition
Jiawei Chen
Jenson Hsiao
C. Ho
26
5
0
03 Aug 2020
AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
Xiaofang Wang
Xuehan Xiong
Maxim Neumann
A. Piergiovanni
Michael S. Ryoo
A. Angelova
Kris Kitani
Wei Hua
36
51
0
23 Jul 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
494
600
0
21 Jul 2020
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Heeseung Kwon
Manjin Kim
Suha Kwak
Minsu Cho
FAtt
46
128
0
20 Jul 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
92
1,716
0
29 Jun 2020
Linformer: Self-Attention with Linear Complexity
Sinong Wang
Belinda Z. Li
Madian Khabsa
Han Fang
Hao Ma
139
1,678
0
08 Jun 2020
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT
3DV
PINN
228
12,847
0
26 May 2020
Quantifying Attention Flow in Transformers
Samira Abnar
Willem H. Zuidema
83
786
0
02 May 2020
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
89
1,009
0
09 Apr 2020
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan-Ran Li
Bin Ji
Xintian Shi
Jianguo Zhang
Bin Kang
Limin Wang
ViT
48
441
0
03 Apr 2020
Speech2Action: Cross-modal Supervision for Action Recognition
Arsha Nagrani
Chen Sun
David A. Ross
Rahul Sukthankar
Cordelia Schmid
Andrew Zisserman
38
54
0
30 Mar 2020
Listen to Look: Action Recognition by Previewing Audio
Ruohan Gao
Tae-Hyun Oh
Kristen Grauman
Lorenzo Torresani
VLM
45
251
0
10 Dec 2019
A Multigrid Method for Efficiently Training Video Models
Chaoxia Wu
Ross B. Girshick
Kaiming He
Christoph Feichtenhofer
Philipp Krahenbuhl
49
94
0
02 Dec 2019
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
Quanfu Fan
Chun-Fu Chen
Hilde Kuehne
Marco Pistoia
David D. Cox
32
126
0
02 Dec 2019
Factorized Multimodal Transformer for Multimodal Sequential Learning
Amir Zadeh
Chengfeng Mao
Kelly Shi
Yiwei Zhang
Paul Pu Liang
Soujanya Poria
Louis-Philippe Morency
40
44
0
22 Nov 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
78
7,386
0
02 Oct 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Evangelos Kazakos
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
38
332
0
22 Aug 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
Hao Hao Tan
Joey Tianyi Zhou
VLM
MLLM
183
2,467
0
20 Aug 2019
STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang
Mengmeng Wang
Weihao Gan
Wei Wu
Junjie Yan
40
381
0
07 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
336
24,160
0
26 Jul 2019
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang
Zihang Dai
Yiming Yang
J. Carbonell
Ruslan Salakhutdinov
Quoc V. Le
AI4CE
154
8,386
0
19 Jun 2019
Learning Spatio-Temporal Representation with Local and Global Diffusion
Zhaofan Qiu
Ting Yao
Chong-Wah Ngo
Xinmei Tian
Tao Mei
20
170
0
13 Jun 2019
Multimodal Transformer for Unaligned Multimodal Language Sequences
Yao-Hung Hubert Tsai
Shaojie Bai
Paul Pu Liang
J. Zico Kolter
Louis-Philippe Morency
Ruslan Salakhutdinov
50
1,280
0
01 Jun 2019
Attention Augmented Convolutional Networks
Irwan Bello
Barret Zoph
Ashish Vaswani
Jonathon Shlens
Quoc V. Le
111
1,008
0
22 Apr 2019
Video Classification with Channel-Separated Convolutional Networks
Du Tran
Heng Wang
Lorenzo Torresani
Matt Feiszli
3DV
35
583
0
04 Apr 2019
DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition
Zheng Shou
Xudong Lin
Yannis Kalantidis
Laura Sevilla-Lara
Marcus Rohrbach
Shih-Fu Chang
Zhicheng Yan
VGen
54
120
0
11 Jan 2019
D3D: Distilled 3D Networks for Video Action Recognition
Jonathan C. Stroud
David A. Ross
Chen Sun
Jia Deng
Rahul Sukthankar
3DPC
40
159
0
19 Dec 2018
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
130
3,244
0
10 Dec 2018
Video Action Transformer Network
Rohit Girdhar
João Carreira
Carl Doersch
Andrew Zisserman
ViT
104
706
0
06 Dec 2018
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
Chuang Gan
Song Han
60
1,677
0
20 Nov 2018
A
2
A^2
A
2
-Nets: Double Attention Networks
Yunpeng Chen
Yannis Kalantidis
Jianshu Li
Shuicheng Yan
Jiashi Feng
46
531
0
27 Oct 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
751
93,936
0
11 Oct 2018
Multi-Fiber Networks for Video Recognition
Yunpeng Chen
Yannis Kalantidis
Jianshu Li
Shuicheng Yan
Jiashi Feng
CVBM
82
217
0
30 Jul 2018
End-to-End Learning of Motion Representation for Video Understanding
Lijie Fan
Wen-bing Huang
Chuang Gan
Stefano Ermon
Boqing Gong
Junzhou Huang
39
214
0
02 Apr 2018
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Sijie Yan
Yuanjun Xiong
Dahua Lin
GNN
179
4,124
0
23 Jan 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
119
1,317
0
13 Dec 2017
Compressed Video Action Recognition
Chao-Yuan Wu
Manzil Zaheer
Hexiang Hu
R. Manmatha
Alex Smola
Philipp Krahenbuhl
116
325
0
02 Dec 2017
Relation Networks for Object Detection
Han Hu
Jiayuan Gu
Zheng Zhang
Jifeng Dai
Yichen Wei
ObjD
66
1,222
0
30 Nov 2017
A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran
Heng Wang
Lorenzo Torresani
Jamie Ray
Yann LeCun
Manohar Paluri
172
3,007
0
30 Nov 2017
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
Zhaofan Qiu
Ting Yao
Tao Mei
52
1,655
0
28 Nov 2017
1
2
Next