Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.11178
Cited By
v1
v2
v3 (latest)
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"
50 / 360 papers shown
Title
Zero-Shot Video Captioning with Evolving Pseudo-Tokens
Yoad Tewel
Yoav Shalev
Roy Nadler
Idan Schwartz
Lior Wolf
61
27
0
22 Jul 2022
GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
Huseyin Coskun
Alireza Zareian
Joshua L. Moore
F. Tombari
Chen Wang
SSL
98
3
0
20 Jul 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
Xiaoping Han
Licheng Yu
Xiatian Zhu
Li Zhang
Yi-Zhe Song
Tao Xiang
AI4TS
49
49
0
17 Jul 2022
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
Sumanth Gurram
An Fang
David M. Chan
John F. Canny
VLM
AI4TS
67
1
0
16 Jul 2022
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
A. Shirian
Krishna Somandepalli
Victor Sanchez
T. Guha
61
3
0
16 Jul 2022
Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique
Changnam An
Eunkyung Han
Dongmyeong Noh
O. Kwon
Sumi Lee
H. Han
SLR
40
1
0
12 Jul 2022
Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays
Yan Han
G. Holste
Ying Ding
Ahmed H. Tewfik
Yifan Peng
Zhangyang Wang
LM&MA
ViT
124
16
0
10 Jul 2022
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations
Madeline Chantry Schiappa
Shruti Vyas
Hamid Palangi
Yogesh S Rawat
Vibhav Vineet
VLM
162
20
0
05 Jul 2022
GraphVid: It Only Takes a Few Nodes to Understand a Video
Eitan Kosman
Dotan Di Castro
GNN
83
5
0
04 Jul 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Hongsheng Li
96
206
0
27 Jun 2022
AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry Estimation
Nimet Kaygusuz
Oscar Alejandro Mendez Maldonado
Richard Bowden
77
5
0
26 Jun 2022
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos
S. H. Khorasgani
Yuxuan Chen
Florian Shkurti
SSL
114
24
0
25 Jun 2022
Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching
Nicola Messina
D. Coccomini
Andrea Esuli
Fabrizio Falchi
26
6
0
21 Jun 2022
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
128
136
0
18 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
148
388
0
17 Jun 2022
iBoot: Image-bootstrapped Self-Supervised Video Representation Learning
F. Saleh
Fuwen Tan
Adrian Bulat
Georgios Tzimiropoulos
Brais Martínez
SSL
94
1
0
16 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
90
84
0
14 Jun 2022
It's Time for Artistic Correspondence in Music and Video
Dídac Surís
Carl Vondrick
Bryan C. Russell
Justin Salamon
64
37
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
233
575
0
13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Hongsheng Li
Xiaogang Wang
Jifeng Dai
MoMe
MoE
93
70
0
09 Jun 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Basil Mustafa
C. Riquelme
J. Puigcerver
Rodolphe Jenatton
N. Houlsby
VLM
MoE
170
205
0
06 Jun 2022
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Jiayuan He
Daniel V. Smith
Flora D. Salim
AI4TS
75
37
0
06 Jun 2022
Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN
Siyuan Li
Di Wu
Fang Wu
Lei Shang
Stan.Z.Li
84
49
0
27 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
93
224
0
24 May 2022
Non-Parametric Domain Adaptation for End-to-End Speech Translation
Yichao Du
Weizhi Wang
Zhirui Zhang
Boxing Chen
Tong Xu
Jun Xie
Enhong Chen
136
18
0
23 May 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang
Manling Li
Ruochen Xu
Luowei Zhou
Jie Lei
...
Chenguang Zhu
Derek Hoiem
Shih-Fu Chang
Joey Tianyi Zhou
Heng Ji
MLLM
VLM
225
142
0
22 May 2022
Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal Human Activity Recognition
Razvan Brinzea
Bulat Khaertdinov
S. Asteriadis
SSL
HAI
92
13
0
20 May 2022
TransTab: Learning Transferable Tabular Transformers Across Tables
Zifeng Wang
Jimeng Sun
LMTD
85
151
0
19 May 2022
Multimodal Conversational AI: A Survey of Datasets and Approaches
Anirudh S. Sundar
Larry Heck
102
30
0
13 May 2022
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code
Yong Dai
Duyu Tang
Liangxin Liu
Minghuan Tan
Cong Zhou
Jingquan Wang
Zhangyin Feng
Fan Zhang
Xueyu Hu
Shuming Shi
VLM
MoE
83
26
0
12 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
220
1,311
0
04 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
DongDong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
107
49
0
03 May 2022
Where in the World is this Image? Transformer-based Geo-localization in the Wild
Shraman Pramanick
E. Nowara
Joshua Gleason
Carlos D. Castillo
Rama Chellappa
ViT
60
37
0
29 Apr 2022
Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities
Jiandian Zeng
Tianyi Liu
Jiantao Zhou
151
63
0
28 Apr 2022
Pseudo strong labels for large scale weakly supervised audio tagging
Heinrich Dinkel
Zhiyong Yan
Yongqing Wang
Junbo Zhang
Yujun Wang
61
6
0
28 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
81
44
0
26 Apr 2022
UAMD-Net: A Unified Adaptive Multimodal Neural Network for Dense Depth Completion
Guancheng Chen
Jun-Ming Lin
Huabiao Qin
3DPC
45
8
0
16 Apr 2022
Are Multimodal Transformers Robust to Missing Modality?
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
111
155
0
12 Apr 2022
Representation Learning by Detecting Incorrect Location Embeddings
Sepehr Sameni
Simon Jenni
Paolo Favaro
ViT
62
5
0
10 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
136
43
0
06 Apr 2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Roman Bachmann
David Mizrahi
Andrei Atanov
Amir Zamir
142
278
0
04 Apr 2022
Deformable Video Transformer
Jue Wang
Lorenzo Torresani
ViT
98
28
0
31 Mar 2022
Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models
Feng Cheng
Ming Xu
Yuanjun Xiong
Hao Chen
Xinyu Li
Wei Li
Wei Xia
63
17
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
111
95
0
30 Mar 2022
Multimodal Pre-training Based on Graph Attention Network for Document Understanding
Zhenrong Zhang
Jiefeng Ma
Jun Du
Licheng Wang
Jianshu Zhang
55
38
0
25 Mar 2022
Self-supervised Video-centralised Transformer for Video Face Clustering
Yujiang Wang
Mingzhi Dong
Jie Shen
Yi-Si Luo
Yiming Lin
Pingchuan Ma
Stavros Petridis
Maja Pantic
ViT
69
3
0
24 Mar 2022
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
Thanh-Dat Truong
Quoc-Huy Bui
C. Duong
Han-Seok Seo
Son Lam Phung
Xin Li
Khoa Luu
ViT
116
51
0
19 Mar 2022
All in One: Exploring Unified Video-Language Pre-training
Alex Jinpeng Wang
Yixiao Ge
Rui Yan
Yuying Ge
Xudong Lin
Guanyu Cai
Jianping Wu
Ying Shan
Xiaohu Qie
Mike Zheng Shou
111
202
0
14 Mar 2022
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Saghir Alfasly
Jian Lu
C. Xu
Yuru Zou
101
19
0
06 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
90
33
0
02 Mar 2022
Previous
1
2
3
4
5
6
7
8
Next