ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.05095
  4. Cited By
Is Space-Time Attention All You Need for Video Understanding?
v1v2v3v4 (latest)

Is Space-Time Attention All You Need for Video Understanding?

9 February 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
    ViT
ArXiv (abs)PDFHTMLGithub (1694★)

Papers citing "Is Space-Time Attention All You Need for Video Understanding?"

50 / 108 papers shown
Title
AttentionNAS: Spatiotemporal Attention Cell Search for Video
  Classification
AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
Xiaofang Wang
Xuehan Xiong
Maxim Neumann
A. Piergiovanni
Michael S. Ryoo
A. Angelova
Kris Kitani
Wei Hua
68
51
0
23 Jul 2020
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Heeseung Kwon
Manjin Kim
Suha Kwak
Minsu Cho
FAtt
77
128
0
20 Jul 2020
End-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Nicolas Carion
Francisco Massa
Gabriel Synnaeve
Nicolas Usunier
Alexander Kirillov
Sergey Zagoruyko
ViT3DVPINN
415
13,048
0
26 May 2020
Quantifying Attention Flow in Transformers
Quantifying Attention Flow in Transformers
Samira Abnar
Willem H. Zuidema
157
796
0
02 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
118
503
0
01 May 2020
Exploring Self-attention for Image Recognition
Exploring Self-attention for Image Recognition
Hengshuang Zhao
Jiaya Jia
V. Koltun
SSL
95
786
0
28 Apr 2020
X3D: Expanding Architectures for Efficient Video Recognition
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer
134
1,020
0
09 Apr 2020
TEA: Temporal Excitation and Aggregation for Action Recognition
TEA: Temporal Excitation and Aggregation for Action Recognition
Yan-Ran Li
Bin Ji
Xintian Shi
Jianguo Zhang
Bin Kang
Limin Wang
ViT
87
447
0
03 Apr 2020
Actor-Transformers for Group Activity Recognition
Actor-Transformers for Group Activity Recognition
Kirill Gavrilyuk
Ryan Sanford
Mehrsan Javan
Cees G. M. Snoek
ViT
59
180
0
28 Mar 2020
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Zachary Teed
Jia Deng
MDE
241
2,625
0
26 Mar 2020
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Huiyu Wang
Yukun Zhu
Bradley Green
Hartwig Adam
Alan Yuille
Liang-Chieh Chen
3DPC
119
672
0
17 Mar 2020
Axial Attention in Multidimensional Transformers
Axial Attention in Multidimensional Transformers
Jonathan Ho
Nal Kalchbrenner
Dirk Weissenborn
Tim Salimans
110
532
0
20 Dec 2019
Classifying, Segmenting, and Tracking Object Instances in Video with
  Mask Propagation
Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation
Gedas Bertasius
Lorenzo Torresani
56
179
0
10 Dec 2019
More Is Less: Learning Efficient Video Representations by Big-Little
  Network and Depthwise Temporal Aggregation
More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation
Quanfu Fan
Chun-Fu Chen
Hilde Kuehne
Marco Pistoia
David D. Cox
74
126
0
02 Dec 2019
On the Relationship between Self-Attention and Convolutional Layers
On the Relationship between Self-Attention and Convolutional Layers
Jean-Baptiste Cordonnier
Andreas Loukas
Martin Jaggi
112
534
0
08 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
439
20,181
0
23 Oct 2019
STM: SpatioTemporal and Motion Encoding for Action Recognition
STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang
Mengmeng Wang
Weihao Gan
Wei Wu
Junjie Yan
79
382
0
07 Aug 2019
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Only Time Can Tell: Discovering Temporal Data for Temporal Modeling
Laura Sevilla-Lara
Shengxin Cindy Zha
Zhicheng Yan
Vedanuj Goswami
Matt Feiszli
Lorenzo Torresani
87
75
0
19 Jul 2019
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue
  Systems
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
Hung Le
Doyen Sahoo
Nancy F. Chen
Guosheng Lin
63
111
0
02 Jul 2019
Stand-Alone Self-Attention in Vision Models
Stand-Alone Self-Attention in Vision Models
Prajit Ramachandran
Niki Parmar
Ashish Vaswani
Irwan Bello
Anselm Levskaya
Jonathon Shlens
VLMSLRViT
95
1,214
0
13 Jun 2019
Learning Spatio-Temporal Representation with Local and Global Diffusion
Learning Spatio-Temporal Representation with Local and Global Diffusion
Zhaofan Qiu
Ting Yao
Chong-Wah Ngo
Xinmei Tian
Tao Mei
56
171
0
13 Jun 2019
Video Modeling with Correlation Networks
Video Modeling with Correlation Networks
Heng Wang
Du Tran
Lorenzo Torresani
Matt Feiszli
67
129
0
07 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million
  Narrated Video Clips
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Antoine Miech
Dimitri Zhukov
Jean-Baptiste Alayrac
Makarand Tapaswi
Ivan Laptev
Josef Sivic
VGen
110
1,200
0
07 Jun 2019
Scaling Autoregressive Video Models
Scaling Autoregressive Video Models
Dirk Weissenborn
Oscar Täckström
Jakob Uszkoreit
DiffMVGen
94
201
0
06 Jun 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
125
1,899
0
23 Apr 2019
Attention Augmented Convolutional Networks
Attention Augmented Convolutional Networks
Irwan Bello
Barret Zoph
Ashish Vaswani
Jonathon Shlens
Quoc V. Le
132
1,014
0
22 Apr 2019
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural
  Networks with Octave Convolution
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
Yunpeng Chen
Haoqi Fan
Bing Xu
Zhicheng Yan
Yannis Kalantidis
Marcus Rohrbach
Shuicheng Yan
Jiashi Feng
103
560
0
10 Apr 2019
Video Classification with Channel-Separated Convolutional Networks
Video Classification with Channel-Separated Convolutional Networks
Du Tran
Heng Wang
Lorenzo Torresani
Matt Feiszli
3DV
72
587
0
04 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLMSSL
79
1,246
0
03 Apr 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
241
3,730
0
09 Jan 2019
D3D: Distilled 3D Networks for Video Action Recognition
D3D: Distilled 3D Networks for Video Action Recognition
Jonathan C. Stroud
David A. Ross
Chen Sun
Jia Deng
Rahul Sukthankar
3DPC
54
160
0
19 Dec 2018
Long-Term Feature Banks for Detailed Video Understanding
Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu
Christoph Feichtenhofer
Haoqi Fan
Kaiming He
Philipp Krahenbuhl
Ross B. Girshick
174
480
0
12 Dec 2018
SlowFast Networks for Video Recognition
SlowFast Networks for Video Recognition
Christoph Feichtenhofer
Haoqi Fan
Jitendra Malik
Kaiming He
166
3,274
0
10 Dec 2018
Video Action Transformer Network
Video Action Transformer Network
Rohit Girdhar
João Carreira
Carl Doersch
Andrew Zisserman
ViT
126
709
0
06 Dec 2018
Graph-Based Global Reasoning Networks
Graph-Based Global Reasoning Networks
Yunpeng Chen
Marcus Rohrbach
Zhicheng Yan
Shuicheng Yan
Jiashi Feng
Yannis Kalantidis
GNNNAI
302
459
0
30 Nov 2018
CCNet: Criss-Cross Attention for Semantic Segmentation
CCNet: Criss-Cross Attention for Semantic Segmentation
Zilong Huang
Xinggang Wang
Yunchao Wei
Lichao Huang
Humphrey Shi
Wenyu Liu
Chang Huang
VOS
212
2,546
0
28 Nov 2018
TSM: Temporal Shift Module for Efficient Video Understanding
TSM: Temporal Shift Module for Efficient Video Understanding
Ji Lin
Chuang Gan
Song Han
98
1,691
0
20 Nov 2018
$A^2$-Nets: Double Attention Networks
A2A^2A2-Nets: Double Attention Networks
Yunpeng Chen
Yannis Kalantidis
Jianshu Li
Shuicheng Yan
Jiashi Feng
77
532
0
27 Oct 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
94,891
0
11 Oct 2018
A Short Note about Kinetics-600
A Short Note about Kinetics-600
João Carreira
Eric Noland
Andras Banki-Horvath
Chloe Hillier
Andrew Zisserman
84
527
0
03 Aug 2018
Multi-Fiber Networks for Video Recognition
Multi-Fiber Networks for Video Recognition
Yunpeng Chen
Yannis Kalantidis
Jianshu Li
Shuicheng Yan
Jiashi Feng
CVBM
105
218
0
30 Jul 2018
Scaling Neural Machine Translation
Scaling Neural Machine Translation
Myle Ott
Sergey Edunov
David Grangier
Michael Auli
AIMat
172
615
0
01 Jun 2018
The Best of Both Worlds: Combining Recent Advances in Neural Machine
  Translation
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
Mengzhao Chen
Orhan Firat
Ankur Bapna
Melvin Johnson
Wolfgang Macherey
...
Niki Parmar
M. Schuster
Zhifeng Chen
Yonghui Wu
Macduff Hughes
AIMat
58
457
0
26 Apr 2018
End-to-End Dense Video Captioning with Masked Transformer
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou
Yingbo Zhou
Jason J. Corso
R. Socher
Caiming Xiong
92
528
0
03 Apr 2018
Image Transformer
Image Transformer
Niki Parmar
Ashish Vaswani
Jakob Uszkoreit
Lukasz Kaiser
Noam M. Shazeer
Alexander Ku
Dustin Tran
ViT
138
1,680
0
15 Feb 2018
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in
  Video Classification
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
Saining Xie
Chen Sun
Jonathan Huang
Zhuowen Tu
Kevin Patrick Murphy
3DH
142
1,330
0
13 Dec 2017
Relation Networks for Object Detection
Relation Networks for Object Detection
Han Hu
Jiayuan Gu
Zheng Zhang
Jifeng Dai
Yichen Wei
ObjD
119
1,223
0
30 Nov 2017
A Closer Look at Spatiotemporal Convolutions for Action Recognition
A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran
Heng Wang
Lorenzo Torresani
Jamie Ray
Yann LeCun
Manohar Paluri
218
3,030
0
30 Nov 2017
Non-local Neural Networks
Non-local Neural Networks
Xinyu Wang
Ross B. Girshick
Abhinav Gupta
Kaiming He
OffRL
289
8,906
0
21 Nov 2017
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling
  Approaches for Large-scale Video Classification
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification
Yunlong Bian
Chuang Gan
Xiao-Chang Liu
Fu Li
Xiang Long
Yandong Li
Heng Qi
Jie Zhou
Shilei Wen
Yuanqing Lin
44
48
0
12 Aug 2017
Previous
123
Next