Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.01529
Cited By
BEVT: BERT Pretraining of Video Transformers
2 December 2021
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BEVT: BERT Pretraining of Video Transformers"
47 / 147 papers shown
Title
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
25
11
0
02 Dec 2022
Spatio-Temporal Crop Aggregation for Video Representation Learning
Sepehr Sameni
Simon Jenni
Paolo Favaro
24
3
0
30 Nov 2022
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
Huaishao Luo
Junwei Bao
Youzheng Wu
Xiaodong He
Tianrui Li
VLM
32
144
0
27 Nov 2022
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar
Ali Etemad
19
21
0
25 Nov 2022
SVFormer: Semi-supervised Video Transformer for Action Recognition
Zhen Xing
Qi Dai
Hang-Rui Hu
Jingjing Chen
Zuxuan Wu
Yu-Gang Jiang
ViT
33
69
0
23 Nov 2022
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training
Yuanze Lin
Chen Wei
Huiyu Wang
Alan Yuille
Cihang Xie
3DGS
34
15
0
21 Nov 2022
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
W. G. C. Bandara
Naman Patel
A. Gholami
Mehdi Nikkhah
M. Agrawal
Vishal M. Patel
25
39
0
16 Nov 2022
Masked Motion Encoding for Self-Supervised Video Representation Learning
Xinyu Sun
Peihao Chen
Liang-Chieh Chen
Chan Li
Thomas H. Li
Mingkui Tan
Chuang Gan
27
29
0
12 Oct 2022
It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training
Yuxin Song
Min Yang
Wenhao Wu
Dongliang He
Fu Li
Jingdong Wang
ViT
97
8
0
11 Oct 2022
Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Haosen Yang
Deng Huang
Bin Wen
Jiannan Wu
H. Yao
Yi-Xin Jiang
Xiatian Zhu
Zehuan Yuan
37
19
0
09 Oct 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
35
148
0
15 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
25
68
0
14 Sep 2022
Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
Lishun Wang
Miao Cao
Yong Zhong
Xin Yuan
19
47
0
04 Sep 2022
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
William Yang Wang
Lijuan Wang
Zicheng Liu
VLM
26
64
0
04 Sep 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
54
158
0
25 Aug 2022
Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling
Rui Wang
Zuxuan Wu
Dongdong Chen
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Luowei Zhou
Lu Yuan
Yu-Gang Jiang
ViT
40
4
0
25 Aug 2022
A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond
Chaoning Zhang
Chenshuang Zhang
Junha Song
John Seon Keun Yi
Kang Zhang
In So Kweon
SSL
57
71
0
30 Jul 2022
MAR: Masked Autoencoders for Efficient Action Recognition
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Xiang Wang
Yuehuang Wang
Yiliang Lv
Changxin Gao
Nong Sang
32
42
0
24 Jul 2022
Bootstrapped Masked Autoencoders for Vision BERT Pretraining
Xiaoyi Dong
Jianmin Bao
Ting Zhang
Dongdong Chen
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
22
75
0
14 Jul 2022
Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning
Boeun Kim
H. Chang
Jungho Kim
J. Choi
ViT
13
48
0
13 Jul 2022
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning
Ting Yao
Yingwei Pan
Yehao Li
Chong-Wah Ngo
Tao Mei
ViT
154
137
0
11 Jul 2022
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
Jiangning Zhang
Xiangtai Li
Yabiao Wang
Chengjie Wang
Yibo Yang
Yong Liu
Dacheng Tao
ViT
34
32
0
19 Jun 2022
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Y. S. Rawat
M. Shah
SSL
36
131
0
18 Jun 2022
OmniMAE: Single Model Masked Pretraining on Images and Videos
Rohit Girdhar
Alaaeldin El-Nouby
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
ViT
37
97
0
16 Jun 2022
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
Lingchen Meng
Xiyang Dai
Yinpeng Chen
Pengchuan Zhang
Dongdong Chen
Mengchen Liu
Jianfeng Wang
Zuxuan Wu
Lu Yuan
Yu-Gang Jiang
ObjD
16
22
0
07 Jun 2022
A Survey on Video Action Recognition in Sports: Datasets, Methods and Applications
Fei Wu
Qingzhong Wang
Jian Bian
Haoyi Xiong
Ning Ding
Feixiang Lu
Junqing Cheng
Dejing Dou
AI4TS
26
52
0
02 Jun 2022
Reduce Information Loss in Transformers for Pluralistic Image Inpainting
Qiankun Liu
Zhentao Tan
Dongdong Chen
Qi Chu
Xiyang Dai
Yinpeng Chen
Mengchen Liu
Lu Yuan
Nenghai Yu
ViT
31
70
0
10 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
Dongdong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
40
45
0
03 May 2022
Bringing Old Films Back to Life
Bo Liu
Bo Zhang
Dongdong Chen
Jing Liao
ViT
VGen
19
42
0
31 Mar 2022
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Bo He
Xitong Yang
Le Kang
Zhiyu Cheng
Xingfa Zhou
Abhinav Shrivastava
33
77
0
29 Mar 2022
ObjectFormer for Image Manipulation Detection and Localization
Junke Wang
Zuxuan Wu
Jingjing Chen
Xintong Han
Abhinav Shrivastava
Ser-Nam Lim
Yu-Gang Jiang
37
108
0
28 Mar 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
137
1,129
0
23 Mar 2022
Visual Prompt Tuning
Menglin Jia
Luming Tang
Bor-Chun Chen
Claire Cardie
Serge Belongie
Bharath Hariharan
Ser-Nam Lim
VLM
VPVLM
40
1,527
0
23 Mar 2022
Protecting Celebrities from DeepFake with Identity Consistency Transformer
Xiaoyi Dong
Jianmin Bao
Dongdong Chen
Ting Zhang
Weiming Zhang
Nenghai Yu
Dong Chen
Fang Wen
B. Guo
ViT
51
120
0
02 Mar 2022
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
3DGS
36
38
0
20 Jan 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
22
103
0
16 Jan 2022
AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
Lingchen Meng
Hengduo Li
Bor-Chun Chen
Shiyi Lan
Zuxuan Wu
Yu-Gang Jiang
Ser-Nam Lim
ViT
25
219
0
30 Nov 2021
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
Xiaoyi Dong
Jianmin Bao
Ting Zhang
Dongdong Chen
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
Baining Guo
ViT
48
238
0
24 Nov 2021
Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang
Xitong Yang
Hengduo Li
Li Liu
Zuxuan Wu
Yu-Gang Jiang
ViT
21
63
0
23 Nov 2021
Semi-Supervised Vision Transformers
Zejia Weng
Xitong Yang
Ang Li
Zuxuan Wu
Yu-Gang Jiang
ViT
17
40
0
22 Nov 2021
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
308
7,443
0
11 Nov 2021
Mobile-Former: Bridging MobileNet and Transformer
Yinpeng Chen
Xiyang Dai
Dongdong Chen
Mengchen Liu
Xiaoyi Dong
Lu Yuan
Zicheng Liu
ViT
183
476
0
12 Aug 2021
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
338
5,785
0
29 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
251
577
0
22 Apr 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
280
1,982
0
09 Feb 2021
Video Transformer Network
Daniel Neimark
Omri Bar
Maya Zohar
Dotan Asselmann
ViT
204
422
0
01 Feb 2021
ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari
Kamaljeet Singh
Thomas Brox
136
496
0
24 Apr 2018
Previous
1
2
3