ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.01529
  4. Cited By
BEVT: BERT Pretraining of Video Transformers

BEVT: BERT Pretraining of Video Transformers

2 December 2021
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
    ViT
ArXivPDFHTML

Papers citing "BEVT: BERT Pretraining of Video Transformers"

50 / 147 papers shown
Title
Self-distilled Masked Attention guided masked image modeling with noise
  Regularized Teacher (SMART) for medical image analysis
Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis
Jue Jiang
Aneesh Rangnekar
Chloe Min Seo Choi
H. Veeraraghavan
MedIm
16
0
0
02 Oct 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
  Transfer Learning
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
19
18
0
14 Sep 2023
Interpretability-Aware Vision Transformer
Interpretability-Aware Vision Transformer
Yao Qiang
Chengyin Li
Prashant Khanduri
D. Zhu
ViT
80
7
0
14 Sep 2023
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
  Spotting using Transformers
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
J. Denize
Mykola Liashuha
Jaonary Rabarisoa
Astrid Orcesi
Romain Hérault
ViT
15
13
0
03 Sep 2023
Self-Supervised Video Transformers for Isolated Sign Language
  Recognition
Self-Supervised Video Transformers for Isolated Sign Language Recognition
Marcelo Sandoval-Castaneda
Yanhong Li
D. Brentari
Karen Livescu
Gregory Shakhnarovich
SLR
15
2
0
02 Sep 2023
Motion-Guided Masking for Spatiotemporal Representation Learning
Motion-Guided Masking for Spatiotemporal Representation Learning
D. Fan
Jue Wang
Shuai Liao
Yi Zhu
Vimal Bhat
H. Santos-Villalobos
M. Rohith
Xinyu Li
VGen
24
19
0
24 Aug 2023
MOFO: MOtion FOcused Self-Supervision for Video Understanding
MOFO: MOtion FOcused Self-Supervision for Video Understanding
Mona Ahmadian
Frank Guerin
Andrew Gilbert
34
2
0
23 Aug 2023
Towards Privacy-Supporting Fall Detection via Deep Unsupervised
  RGB2Depth Adaptation
Towards Privacy-Supporting Fall Detection via Deep Unsupervised RGB2Depth Adaptation
Hejun Xiao
Kunyu Peng
Xiangsheng Huang
Alina Roitberg
Hao Li
Zhao Wang
Rainer Stiefelhagen
18
3
0
23 Aug 2023
MGMAE: Motion Guided Masking for Video Masked Autoencoding
MGMAE: Motion Guided Masking for Video Masked Autoencoding
Bingkun Huang
Zhiyu Zhao
Guozhen Zhang
Yu Qiao
Limin Wang
26
30
0
21 Aug 2023
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning
  on Point Cloud Videos
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos
Zhiqiang Shen
Xiaoxiao Sheng
Hehe Fan
Longguang Wang
Y. Guo
Qiong Liu
Hao-Kai Wen
Xiaoping Zhou
3DPC
15
14
0
18 Aug 2023
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Yi Bin
Haoxuan Li
Yahui Xu
Xing Xu
Yang Yang
Heng Tao Shen
VOS
24
18
0
08 Aug 2023
AntGPT: Can Large Language Models Help Long-term Action Anticipation
  from Videos?
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Qi Zhao
Shijie Wang
Ce Zhang
Changcheng Fu
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
LM&Ro
46
49
0
31 Jul 2023
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
  Recognition
Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Syed Talal Wasim
Muhammad Uzair Khattak
Muzammal Naseer
Salman Khan
M. Shah
F. Khan
ViT
46
19
0
13 Jul 2023
A Survey of Deep Learning in Sports Applications: Perception,
  Comprehension, and Decision
A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
Zhonghan Zhao
Wenhao Chai
Shengyu Hao
Wenhao Hu
Guanhong Wang
Shidong Cao
Min-Gyoo Song
Jenq-Neng Hwang
Gaoang Wang
27
17
0
07 Jul 2023
Contrastive Predictive Autoencoders for Dynamic Point Cloud
  Self-Supervised Learning
Contrastive Predictive Autoencoders for Dynamic Point Cloud Self-Supervised Learning
Xiaoxiao Sheng
Zhiqiang Shen
Gang Xiao
3DPC
SSL
28
6
0
22 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text
  Retrieval
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
24
2
0
13 May 2023
ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
34
841
0
09 May 2023
PointCMP: Contrastive Mask Prediction for Self-supervised Learning on
  Point Cloud Videos
PointCMP: Contrastive Mask Prediction for Self-supervised Learning on Point Cloud Videos
Zhiqiang Shen
Xiaoxiao Sheng
Longguang Wang
Y. Guo
Qiong Liu
Xiaoping Zhou
3DPC
SSL
20
14
0
06 May 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video
  Understanding System
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
93
54
0
27 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Efficient Multimodal Fusion via Interactive Prompting
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
28
44
0
13 Apr 2023
Hard Patches Mining for Masked Image Modeling
Hard Patches Mining for Masked Image Modeling
Haochen Wang
Kaiyou Song
Junsong Fan
Yuxi Wang
Jin Xie
Zhaoxiang Zhang
29
59
0
12 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
J. Liu
34
7
0
09 Apr 2023
Self-Supervised Video Similarity Learning
Self-Supervised Video Similarity Learning
Giorgos Kordopatis-Zilos
Giorgos Tolias
Christos Tzelepis
I. Kompatsiaris
Ioannis Patras
Symeon Papadopoulos
SSL
29
8
0
06 Apr 2023
SVT: Supertoken Video Transformer for Efficient Video Understanding
SVT: Supertoken Video Transformer for Efficient Video Understanding
Chen-Ming Pan
Rui Hou
Hanchao Yu
Qifan Wang
Senem Velipasalar
Madian Khabsa
ViT
21
0
0
01 Apr 2023
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Ximeng Sun
Pengchuan Zhang
Peizhao Zhang
Hardik Shah
Kate Saenko
Xide Xia
VLM
18
20
0
31 Mar 2023
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang
Bingkun Huang
Zhiyu Zhao
Zhan Tong
Yinan He
Yi Wang
Yali Wang
Yu Qiao
VGen
54
325
0
29 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
43
154
0
28 Mar 2023
Learning Expressive Prompting With Residuals for Vision Transformers
Learning Expressive Prompting With Residuals for Vision Transformers
Rajshekhar Das
Yonatan Dukler
Avinash Ravichandran
A. Swaminathan
VLM
VPVLM
25
21
0
27 Mar 2023
Continuous Intermediate Token Learning with Implicit Motion Manifold for
  Keyframe Based Motion Interpolation
Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe Based Motion Interpolation
Clinton Mo
Kun Hu
Chengjiang Long
Zhiyong Wang
27
12
0
27 Mar 2023
Towards Scalable Neural Representation for Diverse Videos
Towards Scalable Neural Representation for Diverse Videos
Bo He
Xitong Yang
Hanyu Wang
Zuxuan Wu
Hao Chen
Shuaiyi Huang
Yixuan Ren
Ser-Nam Lim
Abhinav Shrivastava
49
41
0
24 Mar 2023
ViC-MAE: Self-Supervised Representation Learning from Images and Video
  with Contrastive Masked Autoencoders
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
J. Hernandez
Ruben Villegas
Vicente Ordonez
SSL
31
2
0
21 Mar 2023
Remote Sensing Scene Classification with Masked Image Modeling (MIM)
Remote Sensing Scene Classification with Masked Image Modeling (MIM)
Liya Wang
A. Tien
32
3
0
28 Feb 2023
Layer Grafted Pre-training: Bridging Contrastive Learning And Masked
  Image Modeling For Label-Efficient Representations
Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations
Ziyu Jiang
Yinpeng Chen
Mengchen Liu
Dongdong Chen
Xiyang Dai
Lu Yuan
Zicheng Liu
Zhangyang Wang
SSL
VLM
CLIP
32
16
0
27 Feb 2023
AIM: Adapting Image Models for Efficient Video Action Recognition
AIM: Adapting Image Models for Efficient Video Action Recognition
Taojiannan Yang
Yi Zhu
Yusheng Xie
Aston Zhang
C. L. P. Chen
Mu Li
ViT
44
144
0
06 Feb 2023
Representation Deficiency in Masked Language Modeling
Representation Deficiency in Masked Language Modeling
Yu Meng
Jitin Krishnan
Sinong Wang
Qifan Wang
Yuning Mao
Han Fang
Marjan Ghazvininejad
Jiawei Han
Luke Zettlemoyer
79
7
0
04 Feb 2023
Aerial Image Object Detection With Vision Transformer Detector (ViTDet)
Aerial Image Object Detection With Vision Transformer Detector (ViTDet)
Liya Wang
A. Tien
39
7
0
28 Jan 2023
CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
Cheng Lu
Xiaojie Jin
Zhicheng Huang
Qibin Hou
Mingg-Ming Cheng
Jiashi Feng
35
8
0
15 Jan 2023
A Survey on Self-supervised Learning: Algorithms, Applications, and
  Future Trends
A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends
Jie Gui
Tuo Chen
Jing Zhang
Qiong Cao
Zhe Sun
Haoran Luo
Dacheng Tao
29
122
0
13 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
78
36
0
05 Jan 2023
Swin MAE: Masked Autoencoders for Small Datasets
Swin MAE: Masked Autoencoders for Small Datasets
Zián Xu
Yin Dai
Fayu Liu
Weibin Chen
Yue Liu
Li-Li Shi
Sheng Liu
Yuhang Zhou
SyDa
MedIm
ViT
28
28
0
28 Dec 2022
Similarity Contrastive Estimation for Image and Video Soft Contrastive
  Self-Supervised Learning
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
J. Denize
Jaonary Rabarisoa
Astrid Orcesi
Romain Hérault
SSL
14
6
0
21 Dec 2022
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras
Simone Klenk
David Bonello
Lukas Koestler
Nikita Araslanov
Daniel Cremers
21
22
0
20 Dec 2022
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
  Accuracy with ViT-B and ViT-L on ImageNet
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
Xiaoyi Dong
Jianmin Bao
Ting Zhang
Dongdong Chen
Shuyang Gu
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
22
35
0
12 Dec 2022
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu
Eduardo Fonseca
Radu Tudor Ionescu
Mario Lucic
Cordelia Schmid
Anurag Arnab
SSL
32
43
0
09 Dec 2022
Deep Architectures for Content Moderation and Movie Content Rating
Deep Architectures for Content Moderation and Movie Content Rating
Fatih Çagatay Akyön
A. Temi̇zel
28
4
0
08 Dec 2022
Masked Video Distillation: Rethinking Masked Feature Modeling for
  Self-supervised Video Representation Learning
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Lu Yuan
Yu-Gang Jiang
VGen
27
87
0
08 Dec 2022
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
  Priors
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
Zhentao Yu
Zixin Yin
Deyu Zhou
Duomin Wang
Finn Wong
Baoyuan Wang
DiffM
22
35
0
07 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
31
54
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
38
309
0
06 Dec 2022
Prototypical Residual Networks for Anomaly Detection and Localization
Prototypical Residual Networks for Anomaly Detection and Localization
H. Zhang
Zuxuan Wu
Z. Wang
Zhineng Chen
Yuwei Jiang
UQCV
AI4TS
35
62
0
05 Dec 2022
Previous
123
Next