Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.05095
Cited By
v1
v2
v3
v4 (latest)
Is Space-Time Attention All You Need for Video Understanding?
9 February 2021
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1694★)
Papers citing
"Is Space-Time Attention All You Need for Video Understanding?"
50 / 108 papers shown
Title
Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Ping Li
Jianan Ni
Bo Pang
AAML
245
0
0
23 May 2025
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos
Zijia Lu
A S M Iftekhar
Gaurav Mittal
Tianjian Meng
Xiawei Wang
Cheng Zhao
Rohith Kukkala
Ehsan Elhamifar
Mei Chen
59
0
0
22 May 2025
CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework
Wentao Wu
Xinyu Wang
Chenglong Li
Bo Jiang
Jin Tang
Bin Luo
Qi Liu
90
0
0
17 Apr 2025
MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion
Trung Thanh Nguyen
Yasutomo Kawanishi
Vijay John
Takahiro Komamizu
Ichiro Ide
95
0
0
03 Apr 2025
Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians
Jiamin Wu
Hongyang Li
Xiaoke Jiang
Yuan Yao
Lei Zhang
3DGS
132
0
0
01 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
232
0
0
30 Mar 2025
Segment Any Motion in Videos
Nan Huang
Wenzhao Zheng
Chenfeng Xu
Kurt Keutzer
Shanghang Zhang
Angjoo Kanazawa
Qianqian Wang
VOS
91
1
0
28 Mar 2025
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
Zichen Liu
Kunlun Xu
Fuchun Sun
Xu Zou
Yuxin Peng
Jiahuan Zhou
VLM
AI4TS
151
2
0
20 Mar 2025
A Large-Scale Study on Video Action Dataset Condensation
Yang Chen
Sheng Guo
Bo Zheng
Limin Wang
DD
140
2
0
13 Mar 2025
STAA-SNN: Spatial-Temporal Attention Aggregator for Spiking Neural Networks
Tianqing Zhang
Kairong Yu
Xian Zhong
Hongwei Wang
Qi Xu
Qiang Zhang
114
1
0
04 Mar 2025
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
Haoran Tang
Meng Cao
Jinfa Huang
Ruyang Liu
Peng Jin
Ge Li
Xiaodan Liang
Mamba
157
4
0
24 Feb 2025
X-IL: Exploring the Design Space of Imitation Learning Policies
Xiaogang Jia
Atalay Donat
Xi Huang
Xuan Zhao
Denis Blessing
...
Han A. Wang
Hanyi Zhang
Qian Wang
Rudolf Lioutikov
Gerhard Neumann
139
1
0
20 Feb 2025
Learning semantical dynamics and spatiotemporal collaboration for human pose estimation in video
Runyang Feng
Haoming Chen
3DH
127
0
0
15 Feb 2025
MoFM: A Large-Scale Human Motion Foundation Model
Mohammadreza Baharani
Ghazal Alinezhad Noghre
Armin Danesh Pazho
Gabriel Maldonado
Hamed Tabkhi
AI4CE
426
1
0
08 Feb 2025
Human Activity Recognition in an Open World
D. Prijatelj
Samuel Grieggs
Jin Huang
Dawei Du
Ameya Shringi
Christopher Funk
Adam Kaufman
Eric Robertson
Walter J. Scheirer University of Notre Dame
121
3
0
17 Jan 2025
Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
Yunzhi Zhuge
Hongyu Gu
Lu Zhang
Jinqing Qi
Huchuan Lu
VOS
167
3
0
14 Jan 2025
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
Tian Jin
Yuxiao Luo
Yue Ma
Yu Qiao
Yali Wang
Mamba
106
1
0
08 Jan 2025
Measuring Error Alignment for Decision-Making Systems
Binxia Xu
Antonis Bikakis
Daniel Onah
A. Vlachidis
Luke Dickens
81
0
0
03 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
108
26
0
31 Dec 2024
Hierarchical Vector Quantization for Unsupervised Action Segmentation
Federico Spurio
Emad Bahrami
Gianpiero Francesca
Juergen Gall
90
0
0
23 Dec 2024
VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang
Junliang Guo
Xinyi Xie
Tianyu He
Xu Sun
Li Zhao
DRL
VGen
132
4
0
23 Dec 2024
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
VGen
111
0
0
18 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
261
1
0
18 Dec 2024
Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection
Xiaofeng Tan
Hongsong Wang
Xin Geng
Liang Wang
DiffM
AAML
VGen
142
0
0
04 Dec 2024
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
Yuan-Ming Li
An-Lan Wang
Kun-Yu Lin
Yu-Ming Tang
Ling-an Zeng
Jian-Fang Hu
Wei-Shi Zheng
154
6
0
26 Nov 2024
LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces
Anxhelo Diko
Antonino Furnari
Luigi Cinque
G. Farinella
310
0
0
23 Nov 2024
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zenghui Ding
Xianjun Yang
Yining Sun
493
1
0
21 Nov 2024
LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
Siwen Jiao
Yangyi Fang
Baoyun Peng
Wangqun Chen
Bharadwaj Veeravalli
143
4
0
20 Nov 2024
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
139
0
0
20 Nov 2024
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Vipula Rawte
Sarthak Jain
Aarush Sinha
Garv Kaushik
Aman Bansal
...
Aishwarya N. Reganti
Vinija Jain
Aman Chadha
A. Sheth
A. Das
VLM
MLLM
200
1
0
16 Nov 2024
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction
Yang Zhou
Hao Shao
Letian Wang
Steven Waslander
Hongsheng Li
Yu Liu
70
2
0
11 Oct 2024
Restructuring Vector Quantization with the Rotation Trick
Christopher Fifty
Ronald G. Junkins
Dennis Duan
Aniketh Iger
Jerry W. Liu
Ehsan Amid
Sebastian Thrun
Christopher Ré
LLMSV
129
13
0
08 Oct 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
110
9
0
30 Sep 2024
Spacewalker: Traversing Representation Spaces for Fast Interactive Exploration and Annotation of Unstructured Data
Lukas Heine
Fabian Horst
Jana Fragemann
Gijs Luijten
M. Balzer
Jan Egger
F. Bahnsen
M. Sarfraz
Jens Kleesiek
72
0
0
25 Sep 2024
Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification
Xinrui Zhou
Yuhao Huang
Haoran Dou
Shijing Chen
Ao Chang
...
Jie Jessie Ren
Ruobing Huang
Jun Cheng
Wufeng Xue
Dong Ni
MedIm
335
0
0
25 Sep 2024
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Tz-Ying Wu
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
EgoV
111
0
0
28 Jul 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
125
67
0
29 May 2024
Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba
Hongwei Ren
Yue Zhou
Jiadong Zhu
Haotian Fu
Yulong Huang
Xiaopeng Lin
Yuetong Fang
Fei Ma
Hao Yu
Bo-Xun Cheng
Mamba
90
10
0
09 May 2024
ShadowMaskFormer: Mask Augmented Patch Embeddings for Shadow Removal
Zhuohao Li
Guoyang Xie
Guannan Jiang
Zhichao Lu
76
3
0
29 Apr 2024
Reasoning-Enhanced Object-Centric Learning for Videos
Jian Li
Pu Ren
Yang Liu
Hao Sun
OCL
LRM
106
2
0
22 Mar 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
130
1
0
15 Jan 2024
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma
Yaohui Wang
Gengyun Jia
Xinyuan Chen
Ziqiang Liu
Yuan-Fang Li
Cunjian Chen
Yu Qiao
DiffM
VGen
238
270
0
05 Jan 2024
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition
Tianlin Li
Yao Rong
Shiao Wang
Yuan Chen
Zhe Wu
Bowei Jiang
Yonghong Tian
Jin Tang
ViT
106
3
0
18 Dec 2023
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Arun V. Reddy
William Paul
Corban Rivera
Ketul Shah
Celso M. de Melo
Rama Chellappa
83
4
0
05 Dec 2023
Transformer-Based Model for Monocular Visual Odometry: A Video Understanding Approach
André O. Françani
Marcos R. O. A. Máximo
72
8
0
10 May 2023
Boosting Convolution with Efficient MLP-Permutation for Volumetric Medical Image Segmentation
Yi Lin
Xiao Fang
Dong Zhang
Kwang-Ting Cheng
Hao Chen
MedIm
170
4
0
23 Mar 2023
Transfer-learning for video classification: Video Swin Transformer on multiple domains
Daniel de Oliveira
David Martins de Matos
ViT
62
0
0
18 Oct 2022
TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up
Yi Ding
Shiyu Chang
Zhangyang Wang
ViT
113
392
0
14 Feb 2021
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
387
6,768
0
23 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
657
41,103
0
22 Oct 2020
1
2
3
Next