Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.12602
Cited By
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
23 March 2022
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"
50 / 714 papers shown
Title
PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling
Junmyeong Lee
Eui Jun Hwang
Sukmin Cho
Jong C. Park
32
0
0
06 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
H. Li
40
0
0
03 Jan 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
60
24
0
31 Dec 2024
TravelAgent: Generative Agents in the Built Environment
Ariel Noyman
Kai Hu
Kent Larson
AI4CE
34
2
0
25 Dec 2024
The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning
Shentong Mo
37
0
0
23 Dec 2024
Sensitive Image Classification by Vision Transformers
Hanxian He
Campbell Wilson
Thanh Thi Nguyen
Janis Dalins
ViT
76
0
0
21 Dec 2024
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
...
Viorica Patraucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
80
3
0
19 Dec 2024
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Taein Son
Soo Won Seo
Jisong Kim
S. Lee
Jun Won Choi
VGen
74
0
0
18 Dec 2024
DINO-Foresight
\texttt{DINO-Foresight}
DINO-Foresight
: Looking into the Future with DINO
Efstathios Karypidis
Ioannis Kakogeorgiou
Spyros Gidaris
N. Komodakis
AI4CE
82
1
0
16 Dec 2024
Video Representation Learning with Joint-Embedding Predictive Architectures
Katrina Drozdov
Ravid Shwartz-Ziv
Yann LeCun
AI4TS
74
1
0
14 Dec 2024
Annotation Techniques for Judo Combat Phase Classification from Tournament Footage
Anthony Miyaguchi
Jed Moutahir
Tanmay Sutar
75
0
0
10 Dec 2024
Self-Supervised Learning with Probabilistic Density Labeling for Rainfall Probability Estimation
Junha Lee
Sojung An
Sujeong You
Namik Cho
67
0
0
08 Dec 2024
Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications
Daniela Szwarcman
Sujit Roy
P. Fraccaro
Þorsteinn Elí Gíslason
Benedikt Blumenstiel
...
Rahul Ramachandran
Juan Bernabé-Moreno
Manil Maskey
Rahul Ramachandran
Juan Bernabe Moreno
VLM
75
13
0
03 Dec 2024
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models
Xiaomin Li
Xu Jia
Qinghe Wang
Haiwen Diao
Mengmeng Ge
Pengxiang Li
You He
Huchuan Lu
VGen
DiffM
60
3
0
02 Dec 2024
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark
Joseph Heyward
João Carreira
Dima Damen
Andrew Zisserman
Viorica Patraucean
80
2
0
29 Nov 2024
TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition
Yilong Wang
Zilin Gao
Qilong Wang
Zhaofeng Chen
P. Li
Q. Hu
80
1
0
28 Nov 2024
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting
Gyeongjin Kang
Jisang Yoo
Jihyeon Park
Seungtae Nam
Hyeonsoo Im
Sangheon Shin
Sangpil Kim
Eunbyung Park
3DGS
153
3
0
26 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
106
1
0
25 Nov 2024
OccludeNet: A Causal Journey into Mixed-View Actor-Centric Video Action Recognition under Occlusions
Guanyu Zhou
Wenxuan Liu
Wenxin Huang
Xuemei Jia
X. Zhong
Chia-Wen Lin
CML
76
0
0
24 Nov 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
75
0
0
24 Nov 2024
When Spatial meets Temporal in Action Recognition
H. Chen
Lei Wang
Y. Chen
Tom Gedeon
Piotr Koniusz
97
2
0
22 Nov 2024
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning
Jiange Yang
Haoyi Zhu
Y. Wang
Gangshan Wu
Tong He
Limin Wang
95
2
0
21 Nov 2024
Extending Video Masked Autoencoders to 128 frames
N. B. Gundavarapu
Luke Friedman
Raghav Goyal
Chaitra Hegde
Eirikur Agustsson
...
Mikhail Sirotenko
Ming Yang
Tobias Weyand
Boqing Gong
Leonid Sigal
75
1
0
20 Nov 2024
Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
Bing Cao
Quanhao Lu
Jiekang Feng
Pengfei Zhu
Q. Hu
Qilong Wang
73
0
0
20 Nov 2024
Principles of Visual Tokens for Efficient Video Understanding
Xinyue Hao
Gen Li
Shreyank N. Gowda
Robert B Fisher
Jonathan Huang
Anurag Arnab
Laura Sevilla-Lara
96
0
0
20 Nov 2024
KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder
Maheswar Bora
Saurabh Atreya
Aritra Mukherjee
Abhijit Das
87
0
0
19 Nov 2024
Efficient Transfer Learning for Video-language Foundation Models
Haoxing Chen
Zizheng Huang
Y. Hong
Yanshuo Wang
Zhongcai Lyu
Zhuoer Xu
Jun Lan
Zhangxuan Gu
VLM
49
0
0
18 Nov 2024
From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling
Jinhong Lin
Cheng-En Wu
Huanran Li
Jifan Zhang
Yu Hen Hu
Pedro Morgado
36
0
0
16 Nov 2024
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Vipula Rawte
Sarthak Jain
Aarush Sinha
Garv Kaushik
Aman Bansal
...
Aishwarya N. Reganti
Vinija Jain
Aman Chadha
A. Sheth
A. Das
VLM
MLLM
46
1
0
16 Nov 2024
A Transformer-Based Visual Piano Transcription Algorithm
Uros Zivanovic
Carlos Eduardo Cancino-Chacón
ViT
26
0
0
13 Nov 2024
CityGuessr: City-Level Video Geo-Localization on a Global Scale
P. Kulkarni
Gaurav Kumar Nayak
Mubarak Shah
ViT
AI4TS
29
2
0
10 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
26
9
0
07 Nov 2024
HourVideo: 1-Hour Video-Language Understanding
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Manling Li
Jiajun Wu
L. Fei-Fei
VLM
41
31
0
07 Nov 2024
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
Wenhao Wang
Y. Yang
VGen
45
3
0
05 Nov 2024
Can Transformers Smell Like Humans?
Farzaneh Taleb
Miguel Vasco
Antônio H. Ribeiro
Marten Bjorkman
Danica Kragic
AI4CE
ViT
35
0
0
05 Nov 2024
Continual Audio-Visual Sound Separation
Weiguo Pian
Yiyang Nan
Shijian Deng
Shentong Mo
Yunhui Guo
Yapeng Tian
VLM
CLL
43
0
0
05 Nov 2024
AM Flow: Adapters for Temporal Processing in Action Recognition
Tanay Agrawal
Abid Ali
A. Dantcheva
François Brémond
39
0
0
04 Nov 2024
ROAD-Waymo: Action Awareness at Scale for Autonomous Driving
Salman Khan
Izzeddin Teeti
Reza Javanmard Alitappeh
Mihaela C. Stoian
Eleonora Giunchiglia
Gurkirt Singh
Andrew Bradley
Fabio Cuzzolin
40
0
0
03 Nov 2024
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
Zirui Wang
Xinran Zhao
Simon Stepputtis
Woojun Kim
Tongshuang Wu
Katia P. Sycara
Yaqi Xie
OffRL
49
0
0
03 Nov 2024
NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs
Nursena Köprücü
Destiny Okpekpe
Antonio Orvieto
Mamba
36
1
0
31 Oct 2024
Learning Video Representations without Natural Videos
Xueyang Yu
Xinlei Chen
Yossi Gandelsman
VGen
AI4TS
49
0
0
31 Oct 2024
Sparsh: Self-supervised touch representations for vision-based tactile sensing
Carolina Higuera
Akash Sharma
Chaithanya Krishna Bodduluri
Taosha Fan
Patrick E. Lancaster
...
Michael Kaess
Byron Boots
Mike Lambeta
Tingfan Wu
Mustafa Mukadam
34
11
0
31 Oct 2024
MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption
Ruixun Liu
Kaiyu Li
Jiayi Song
Dongwei Sun
Xiangyong Cao
VGen
35
1
0
31 Oct 2024
EchoFM: Foundation Model for Generalizable Echocardiogram Analysis
Sekeun Kim
Pengfei Jin
S. Song
Cheng Chen
Yiwei Li
Hui Ren
Xiang Li
Tianming Liu
Quanzheng Li
39
0
0
30 Oct 2024
Revisiting MAE pre-training for 3D medical image segmentation
Tassilo Wald
Constantin Ulrich
Stanislav Lukyanenko
Andrei Goncharov
Alberto Paderno
Leander Maerkisch
Paul F. Jäger
Paul F. Jäger
Klaus Maier-Hein
42
2
0
30 Oct 2024
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction
Z. Gong
Guangyin Bao
Qi Zhang
Zhongwei Wan
Duoqian Miao
...
Changwei Wang
Rongtao Xu
Liang Hu
Ke Liu
Yu Zhang
DiffM
VGen
51
8
0
25 Oct 2024
Are Visual-Language Models Effective in Action Recognition? A Comparative Study
Mahmoud Ali
Di Yang
François Brémond
VLM
51
0
0
22 Oct 2024
Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios
Jeaung Lee
Jeewoo Lim
Keunho Byeon
Jin Tae Kwak
40
3
0
21 Oct 2024
Upsampling DINOv2 features for unsupervised vision tasks and weakly supervised materials segmentation
Ronan Docherty
Antonis Vamvakeros
Samuel J. Cooper
37
1
0
20 Oct 2024
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
Ruiqi Li
Siqi Zheng
Xize Cheng
Ziang Zhang
Shengpeng Ji
Zhou Zhao
VGen
63
7
0
16 Oct 2024
Previous
1
2
3
4
5
6
...
13
14
15
Next