Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.12602
Cited By
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
23 March 2022
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"
50 / 719 papers shown
Title
Survey on Foundation Models for Prognostics and Health Management in Industrial Cyber-Physical Systems
Ruonan Liu
Quanhu Zhang
Te Han
AI4CE
49
2
0
11 Dec 2023
Counterfactual World Modeling for Physical Dynamics Understanding
Rahul Venkatesh
Honglin Chen
Kevin T. Feigelis
Daniel M. Bear
Khaled Jedoui
...
Wanhee Lee
Sherry Liu
Kevin A. Smith
Judith E. Fan
Daniel L. K. Yamins
VGen
45
1
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos
Yin Chen
Jia Li
Shiguang Shan
Meng Wang
Richang Hong
59
32
0
09 Dec 2023
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos
Ying Wang
Yanlai Yang
Mengye Ren
49
15
0
07 Dec 2023
A brief introduction to a framework named Multilevel Guidance-Exploration Network
Guoqing Yang
Zhiming Luo
Jianzhe Gao
Yingxin Lai
Kun Yang
Yifan He
Shaozi Li
3DH
34
0
0
07 Dec 2023
Deep Multimodal Fusion for Surgical Feedback Classification
Rafal Kocielnik
Elyssa Y. Wong
Timothy N. Chu
Lydia Lin
De-An Huang
Jiayun Wang
A. Anandkumar
Andrew J. Hung
35
2
0
06 Dec 2023
Multitask Learning Can Improve Worst-Group Outcomes
Atharva Kulkarni
Lucio Dery
Amrith Rajagopal Setlur
Aditi Raghunathan
Ameet Talwalkar
Graham Neubig
43
1
0
05 Dec 2023
Are Vision Transformers More Data Hungry Than Newborn Visual Systems?
Lalit Pandey
Samantha M. W. Wood
Justin N. Wood
46
12
0
05 Dec 2023
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Arun V. Reddy
William Paul
Corban Rivera
Ketul Shah
Celso M. de Melo
Rama Chellappa
42
4
0
05 Dec 2023
Bootstrapping SparseFormers from Vision Foundation Models
Ziteng Gao
Zhan Tong
K. Lin
Joya Chen
Mike Zheng Shou
41
0
0
04 Dec 2023
Adapting Short-Term Transformers for Action Detection in Untrimmed Videos
Min Yang
Huan Gao
Ping Guo
Limin Wang
ViT
36
5
0
04 Dec 2023
SANeRF-HQ: Segment Anything for NeRF in High Quality
Yichen Liu
Benran Hu
Chi-Keung Tang
Yu-Wing Tai
41
11
0
03 Dec 2023
Learning from One Continuous Video Stream
João Carreira
Michael King
Viorica Patraucean
Dilara Gokay
Catalin Ionescu
...
Joseph Heyward
Carl Doersch
Y. Aytar
Dima Damen
Andrew Zisserman
CLL
37
4
0
01 Dec 2023
Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting
Haotian Gao
Renhe Jiang
Zheng Dong
Jinliang Deng
Yuxin Ma
Xuan Song
AI4TS
46
15
0
01 Dec 2023
Dolphins: Multimodal Language Model for Driving
Yingzi Ma
Yulong Cao
Jiachen Sun
Marco Pavone
Chaowei Xiao
MLLM
43
51
0
01 Dec 2023
Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement
Ziyu Wang
Yue Xu
Cewu Lu
Yong-Lu Li
DD
46
8
0
01 Dec 2023
CAST: Cross-Attention in Space and Time for Video Action Recognition
Dongho Lee
Jongseo Lee
Jinwoo Choi
EgoV
35
12
0
30 Nov 2023
DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding
Kyungho Bae
Geo Ahn
Youngrae Kim
Jinwoo Choi
30
3
0
30 Nov 2023
Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes
Chi-Hsi Kung
Shu-Wei Lu
Yi-Hsuan Tsai
Yi-Ting Chen
37
6
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
40
0
0
28 Nov 2023
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu
Chen-Da Liu-Zhang
Chen Zhao
Guohao Li
38
25
0
28 Nov 2023
SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models
Yuwei Guo
Ceyuan Yang
Anyi Rao
Maneesh Agrawala
Dahua Lin
Bo Dai
DiffM
VGen
28
115
0
28 Nov 2023
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition
Jiaming Zhou
Hanjun Li
Kun-Yu Lin
Junwei Liang
29
1
0
28 Nov 2023
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Zhengcong Fei
Mingyuan Fan
Junshi Huang
30
17
0
27 Nov 2023
Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
Yifei Chen
Dapeng Chen
Ruijin Liu
Sai Zhou
Wenyuan Xue
Wei Peng
33
6
0
27 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
37
3
0
25 Nov 2023
VLM-Eval: A General Evaluation on Video Large Language Models
Shuailin Li
Yuang Zhang
Yucheng Zhao
Qiuyue Wang
Fan Jia
Yingfei Liu
Tiancai Wang
MLLM
ELM
44
2
0
20 Nov 2023
Pair-wise Layer Attention with Spatial Masking for Video Prediction
Ping Li
Chenhan Zhang
Zheng Yang
Xianghua Xu
Mingli Song
29
0
0
19 Nov 2023
Multi-entity Video Transformers for Fine-Grained Video Representation Learning
Matthew Walmer
Rose Kanjirathinkal
Kai Sheng Tai
Keyur Muzumdar
Taipeng Tian
Abhinav Shrivastava
ViT
37
0
0
17 Nov 2023
Language Semantic Graph Guided Data-Efficient Learning
Wenxuan Ma
Shuang Li
Lincan Cai
Jingxuan Kang
45
4
0
15 Nov 2023
SpectralGPT: Spectral Remote Sensing Foundation Model
Danfeng Hong
Bing Zhang
Xuyang Li
Yuxuan Li
Chenyu Li
...
Xiuping Jia
Antonio J. Plaza
Paolo Gamba
J. Benediktsson
J. Chanussot
43
393
0
13 Nov 2023
Learning Human Action Recognition Representations Without Real Humans
Howard Zhong
Samarth Mishra
Donghyun Kim
SouYoung Jin
Yikang Shen
Hildegard Kuehne
Leonid Karlinsky
Venkatesh Saligrama
Aude Oliva
Rogerio Feris
29
3
0
10 Nov 2023
Semantic-aware Video Representation for Few-shot Action Recognition
Yutao Tang
Benjamin Bejar
René Vidal
44
7
0
10 Nov 2023
Window Attention is Bugged: How not to Interpolate Position Embeddings
Daniel Bolya
Chaitanya K. Ryali
Judy Hoffman
Christoph Feichtenhofer
48
10
0
09 Nov 2023
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
Zhiyu Zhao
Bingkun Huang
Sen Xing
Gangshan Wu
Yu Qiao
Limin Wang
42
5
0
06 Nov 2023
Holistic Representation Learning for Multitask Trajectory Anomaly Detection
Alexandros Stergiou
B. D. Weerdt
Nikos Deligiannis
56
13
0
03 Nov 2023
FLAP: Fast Language-Audio Pre-training
Ching-Feng Yeh
Po-Yao Huang
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
CLIP
VLM
44
8
0
02 Nov 2023
Concatenated Masked Autoencoders as Spatial-Temporal Learner
Zhouqiang Jiang
Bowen Wang
Tong Xiang
Zhaofeng Niu
Hong Tang
Guangshun Li
Liangzhi Li
33
2
0
02 Nov 2023
Limited Data, Unlimited Potential: A Study on ViTs Augmented by Masked Autoencoders
Srijan Das
Tanmay Jain
Dominick Reilly
P. Balaji
Soumyajit Karmakar
Shyam Marjit
Xiang Li
Abhijit Das
Michael S. Ryoo
41
16
0
31 Oct 2023
HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception
Junkun Yuan
Xinyu Zhang
Hao Zhou
Jian Wang
Zhongwei Qiu
...
Junyu Han
Errui Ding
Lanfen Lin
Fei Wu
Jingdong Wang
38
18
0
31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
54
2
0
30 Oct 2023
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
Srikumar Sastry
Subash Khanal
Aayush Dhakal
Di Huang
Nathan Jacobs
44
9
0
29 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
51
28
0
29 Oct 2023
Foundation Models for Generalist Geospatial Artificial Intelligence
Johannes Jakubik
Sujit Roy
C. Phillips
P. Fraccaro
Denys Godwin
...
Hamed Alemohammad
M. Maskey
R. Ganti
Kommy Weldemariam
Rahul Ramachandran
AI4CE
VLM
26
94
0
28 Oct 2023
Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
Fengyuan Shi
Limin Wang
ViT
38
0
0
26 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-xiong Wang
53
25
0
19 Oct 2023
Runner re-identification from single-view running video in the open-world setting
Tomohiro Suzuki
Kazushi Tsutsui
K. Takeda
Keisuke Fujii
31
1
0
18 Oct 2023
An Unbiased Look at Datasets for Visuo-Motor Pre-Training
Sudeep Dasari
Mohan Kumar Srirama
Unnat Jain
Abhinav Gupta
SSL
34
37
0
13 Oct 2023
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
Honghui Yang
Sha Zhang
Di Huang
Xiaoyang Wu
Haoyi Zhu
...
Hengshuang Zhao
Qibo Qiu
Binbin Lin
Xiaofei He
Wanli Ouyang
SSL
39
45
0
12 Oct 2023
Previous
1
2
3
...
8
9
10
...
13
14
15
Next