ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
v1v2v3 (latest)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXiv (abs)PDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 360 papers shown
Title
Audio Self-supervised Learning: A Survey
Audio Self-supervised Learning: A Survey
Shuo Liu
Adria Mallol-Ragolta
Emilia Parada-Cabeleiro
Kun Qian
Xingshuo Jing
Alexander Kathan
Bin Hu
Bjoern W. Schuller
SSL
100
109
0
02 Mar 2022
Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luís Vilacca
Yi Yu
Paula Viana
88
5
0
28 Feb 2022
HiP: Hierarchical Perceiver
HiP: Hierarchical Perceiver
João Carreira
Skanda Koppula
Daniel Zoran
Adrià Recasens
Catalin Ionescu
...
M. Botvinick
Oriol Vinyals
Karen Simonyan
Andrew Zisserman
Andrew Jaegle
VLM
116
14
0
22 Feb 2022
VLP: A Survey on Vision-Language Pre-training
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
181
227
0
18 Feb 2022
Auxiliary Cross-Modal Representation Learning with Triplet Loss
  Functions for Online Handwriting Recognition
Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition
Felix Ott
David Rügamer
Lucas Heublein
Bernd Bischl
Christopher Mutschler
127
10
0
16 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
  Omni Retrieval
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
93
39
0
15 Feb 2022
Visual Acoustic Matching
Visual Acoustic Matching
Changan Chen
Ruohan Gao
P. Calamia
Kristen Grauman
71
58
0
14 Feb 2022
data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSLVLMViT
123
863
0
07 Feb 2022
Self-supervised Graphs for Audio Representation Learning with Limited
  Labeled Data
Self-supervised Graphs for Audio Representation Learning with Limited Labeled Data
A. Shirian
Krishna Somandepalli
T. Guha
SSL
132
10
0
31 Jan 2022
Multimodal data matters: language model pre-training over structured and
  unstructured electronic health records
Multimodal data matters: language model pre-training over structured and unstructured electronic health records
Sicen Liu
Xiaolong Wang
Yongshuai Hou
Ge Li
Hui Wang
Huiqin Xu
Yang Xiang
Buzhou Tang
113
33
0
25 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar
Mannat Singh
Nikhil Ravi
Laurens van der Maaten
Armand Joulin
Ishan Misra
286
237
0
20 Jan 2022
TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
Yue Ruan
Han-Hung Lee
Yiming Zhang
Ke Zhang
Angel X. Chang
95
22
0
19 Jan 2022
Video Transformers: A Survey
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
141
107
0
16 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
90
109
0
13 Jan 2022
Multiview Transformers for Video Recognition
Multiview Transformers for Video Recognition
Shen Yan
Xuehan Xiong
Anurag Arnab
Zhichao Lu
Mi Zhang
Chen Sun
Cordelia Schmid
ViT
97
221
0
12 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and
  Sound
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
111
215
0
07 Jan 2022
Progressive Video Summarization via Multimodal Self-supervised Learning
Progressive Video Summarization via Multimodal Self-supervised Learning
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
AI4TS
64
19
0
07 Jan 2022
Connecting the Dots between Audio and Text without Parallel Data through
  Visual Knowledge Transfer
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
Yanpeng Zhao
Jack Hessel
Youngjae Yu
Ximing Lu
Rowan Zellers
Yejin Choi
118
27
0
16 Dec 2021
Co-training Transformer with Videos and Images Improves Action
  Recognition
Co-training Transformer with Videos and Images Improves Action Recognition
Bowen Zhang
Jiahui Yu
Christopher Fifty
Wei Han
Andrew M. Dai
Ruoming Pang
Fei Sha
ViT
83
54
0
14 Dec 2021
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
  Unpaired Images and Text
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
Qing Li
Boqing Gong
Huayu Chen
Dan Kondratyuk
Xianzhi Du
Ming-Hsuan Yang
Matthew A. Brown
ViT
49
17
0
14 Dec 2021
Contextualized Spatio-Temporal Contrastive Learning with
  Self-Supervision
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
Liangzhe Yuan
Rui Qian
Huayu Chen
Boqing Gong
Florian Schroff
Ming-Hsuan Yang
Hartwig Adam
Ting Liu
AI4TS
105
16
0
09 Dec 2021
Exploring Temporal Granularity in Self-Supervised Video Representation
  Learning
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
Rui Qian
Yeqing Li
Liangzhe Yuan
Boqing Gong
Ting Liu
Matthew A. Brown
Serge Belongie
Ming-Hsuan Yang
Hartwig Adam
Huayu Chen
AI4TS
92
6
0
08 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David Harwath
James R. Glass
Hilde Kuehne
ViT
129
134
0
08 Dec 2021
BEVT: BERT Pretraining of Video Transformers
BEVT: BERT Pretraining of Video Transformers
Rui Wang
Dongdong Chen
Zuxuan Wu
Yinpeng Chen
Xiyang Dai
Mengchen Liu
Yu-Gang Jiang
Luowei Zhou
Lu Yuan
ViT
108
209
0
02 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
124
133
0
02 Dec 2021
SWAT: Spatial Structure Within and Among Tokens
SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya
Michael S. Ryoo
72
6
0
26 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
107
75
0
25 Nov 2021
V2C: Visual Voice Cloning
V2C: Visual Voice Cloning
Qi Chen
Yuanqing Li
Yuankai Qi
Jiaqiu Zhou
Mingkui Tan
Qi Wu
VGen
72
27
0
25 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
  Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
146
221
0
24 Nov 2021
Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
167
907
0
22 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
73
57
0
19 Nov 2021
Scaling Law for Recommendation Models: Towards General-purpose User
  Representations
Scaling Law for Recommendation Models: Towards General-purpose User Representations
Kyuyong Shin
Hanock Kwak
KyungHyun Kim
Max Nihlén Ramström
Jisu Jeong
Jung-Woo Ha
Seon Gyeom Kim
ELM
110
42
0
15 Nov 2021
Multimodal End-to-End Group Emotion Recognition using Cross-Modal
  Attention
Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention
Lev Evtodienko
46
5
0
10 Nov 2021
Machine Learning for Multimodal Electronic Health Records-based
  Research: Challenges and Perspectives
Machine Learning for Multimodal Electronic Health Records-based Research: Challenges and Perspectives
Ziyi Liu
Jiaqi Zhang
Yongshuai Hou
Xinran Zhang
Ge Li
Yang Xiang
96
14
0
09 Nov 2021
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
  Emotion Recognition
MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
Jinming Zhao
Ruichen Li
Qin Jin
Xinchao Wang
Haizhou Li
49
25
0
27 Oct 2021
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text
  Joint Pre-Training
SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training
Ankur Bapna
Yu-An Chung
Na Wu
Anmol Gulati
Ye Jia
J. Clark
Melvin Johnson
Jason Riesa
Alexis Conneau
Yu Zhang
VLM
137
96
0
20 Oct 2021
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks
Sangeeta Srivastava
Yun Wang
Andros Tjandra
Anurag Kumar
Chunxi Liu
Kritika Singh
Yatharth Saraf
SSL
99
25
0
14 Oct 2021
Multi-Modal Pre-Training for Automated Speech Recognition
Multi-Modal Pre-Training for Automated Speech Recognition
David M. Chan
Shalini Ghosh
D. Chakrabarty
Björn Hoffmeister
SSL
92
16
0
12 Oct 2021
Supervision Exists Everywhere: A Data Efficient Contrastive
  Language-Image Pre-training Paradigm
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm
Yangguang Li
Feng Liang
Lichen Zhao
Yufeng Cui
Wanli Ouyang
Jing Shao
F. Yu
Junjie Yan
VLMCLIP
165
458
0
11 Oct 2021
Self-Supervised Video Representation Learning by Video Incoherence
  Detection
Self-Supervised Video Representation Learning by Video Incoherence Detection
Haozhi Cao
Yuecong Xu
Jianfei Yang
K. Mao
Lihua Xie
Jianxiong Yin
Simon See
SSL
52
6
0
26 Sep 2021
Survey: Transformer based Video-Language Pre-training
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLMViT
125
45
0
21 Sep 2021
Multilingual Molecular Representation Learning via Contrastive
  Pre-training
Multilingual Molecular Representation Learning via Contrastive Pre-training
Zhihui Guo
P. Sharma
Andy Martinez
Liang Du
Robin Abraham
77
30
0
18 Sep 2021
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Overview of Tencent Multi-modal Ads Video Understanding Challenge
Zhenzhi Wang
Liyu Wu
Zhimin Li
Jiangfeng Xiong
Qinglin Lu
58
4
0
16 Sep 2021
Cross-lingual Transfer of Monolingual Models
Cross-lingual Transfer of Monolingual Models
Evangelia Gogoulou
Ariel Ekgren
T. Isbister
Magnus Sahlgren
88
18
0
15 Sep 2021
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Shifted Chunk Transformer for Spatio-Temporal Representational Learning
Xuefan Zha
Wentao Zhu
Tingxun Lv
Sen Yang
Ji Liu
AI4TSViT
86
27
0
26 Aug 2021
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Andrew Jaegle
Sebastian Borgeaud
Jean-Baptiste Alayrac
Carl Doersch
Catalin Ionescu
...
Olivier J. Hénaff
M. Botvinick
Andrew Zisserman
Oriol Vinyals
João Carreira
MLLMVLMGNN
133
585
0
30 Jul 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent
  Advances and Future Directions
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
105
143
0
29 Jul 2021
Learning Vision-Guided Quadrupedal Locomotion End-to-End with
  Cross-Modal Transformers
Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers
Ruihan Yang
Minghao Zhang
Nicklas Hansen
Huazhe Xu
Xiaolong Wang
OffRL
82
108
0
08 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
114
574
0
30 Jun 2021
AudioCLIP: Extending CLIP to Image, Text and Audio
AudioCLIP: Extending CLIP to Image, Text and Audio
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
CLIPVLM
130
373
0
24 Jun 2021
Previous
12345678
Next