ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
v1v2v3 (latest)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXiv (abs)PDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 360 papers shown
Title
Alternating Gradient Descent and Mixture-of-Experts for Integrated
  Multimodal Perception
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Hassan Akbari
Dan Kondratyuk
Huayu Chen
Rachel Hornung
Haoran Wang
Hartwig Adam
VLMMoE
103
13
0
10 May 2023
Listen to Look into the Future: Audio-Visual Egocentric Gaze
  Anticipation
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
Bolin Lai
Fiona Ryan
Wenqi Jia
Miao Liu
James M. Rehg
EgoV
90
8
0
06 May 2023
A vector quantized masked autoencoder for audiovisual speech emotion recognition
A vector quantized masked autoencoder for audiovisual speech emotion recognition
Samir Sadok
Simon Leglaive
Renaud Séguier
SSL
177
6
0
05 May 2023
Learning Missing Modal Electronic Health Records with Unified
  Multi-modal Data Embedding and Modality-Aware Attention
Learning Missing Modal Electronic Health Records with Unified Multi-modal Data Embedding and Modality-Aware Attention
Kwanhyung Lee
Soojeong Lee
Sangchul Hahn
Heejung Hyun
Edward Choi
Byungeun Ahn
Joohyung Lee
86
18
0
04 May 2023
An Empirical Study of Multimodal Model Merging
An Empirical Study of Multimodal Model Merging
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
115
42
0
28 Apr 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video
  Understanding System
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
168
57
0
27 Apr 2023
Implicit Temporal Modeling with Learnable Alignment for Video
  Recognition
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
S. Tu
Qi Dai
Zuxuan Wu
Zhi-Qi Cheng
Hang-Rui Hu
Yu-Gang Jiang
109
37
0
20 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
133
112
0
17 Apr 2023
On the Opportunities and Challenges of Foundation Models for Geospatial
  Artificial Intelligence
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
122
134
0
13 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal
  representations
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
59
4
0
11 Apr 2023
On Robustness in Multimodal Learning
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
59
2
0
10 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
111
38
0
09 Apr 2023
Beyond Unimodal: Generalising Neural Processes for Multimodal
  Uncertainty Estimation
Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty Estimation
M. Jung
He Zhao
Joanna Dipnall
Lan Du
UQCVBDL
69
8
0
04 Apr 2023
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion
Sijie Wang
Qiyu Kang
Rui She
Wei Wang
K. Zhao
Yang Song
Wee Peng Tay
105
18
0
03 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
123
40
0
31 Mar 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
125
50
0
31 Mar 2023
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
Kim Sung-Bin
Arda Senocak
H. Ha
Andrew Owens
Tae-Hyun Oh
DiffMVGen
86
39
0
30 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
  Untrimmed Multi-Action Videos from Narrated Instructions
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
112
7
0
29 Mar 2023
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Reuben Tan
Arijit Ray
Andrea Burns
Bryan A. Plummer
Justin Salamon
Oriol Nieto
Bryan C. Russell
Kate Saenko
88
22
0
28 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
55
0
0
28 Mar 2023
Egocentric Auditory Attention Localization in Conversations
Egocentric Auditory Attention Localization in Conversations
Fiona Ryan
Hao Jiang
Abhinav Shukla
James M. Rehg
V. Ithapu
EgoV
70
16
0
28 Mar 2023
3Mformer: Multi-order Multi-mode Transformer for Skeletal Action
  Recognition
3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition
Lei Wang
Piotr Koniusz
ViT
88
50
0
25 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation
  Models
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
121
22
0
23 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
165
48
0
21 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet
  Tag-guided Synthetic Data
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
164
15
0
14 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
73
10
0
12 Mar 2023
Heterogeneous Graph Learning for Acoustic Event Classification
Heterogeneous Graph Learning for Acoustic Event Classification
A. Shirian
Mona Ahmadian
Krishna Somandepalli
T. Guha
71
2
0
05 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TSVLM
173
241
0
27 Feb 2023
Localizing Moments in Long Video Via Multimodal Guidance
Localizing Moments in Long Video Via Multimodal Guidance
Wayner Barrios
Mattia Soldan
Alberto M. Ceballos-Arroyo
Fabian Caba Heilbron
Guohao Li
89
21
0
26 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
148
214
0
20 Feb 2023
Transformadores: Fundamentos teoricos y Aplicaciones
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
169
0
0
18 Feb 2023
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Simon Jenni
Alexander Black
John Collomosse
SSL
77
16
0
15 Feb 2023
A dataset for Audio-Visual Sound Event Detection in Movies
A dataset for Audio-Visual Sound Event Detection in Movies
Rajat Hebbar
Digbalay Bose
Krishna Somandepalli
Veena Vijai
Shrikanth Narayanan
47
9
0
14 Feb 2023
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
Jiang Yang
Sheng Guo
Gangshan Wu
Limin Wang
VLM
58
7
0
13 Feb 2023
Policy-Induced Self-Supervision Improves Representation Finetuning in
  Visual RL
Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL
Sébastien M. R. Arnold
Fei Sha
SSL
46
0
0
12 Feb 2023
Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
  Anti-Spoofing
Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing
Zitong Yu
Rizhao Cai
Yawen Cui
Xin Liu
Yongjian Hu
Alex C. Kot
58
25
0
11 Feb 2023
AV-data2vec: Self-supervised Learning of Audio-Visual Speech
  Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Jiachen Lian
Alexei Baevski
Wei-Ning Hsu
Michael Auli
SSL
150
34
0
10 Feb 2023
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor
  Segmentation in PET/CT Images
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images
Gary Y. Li
Junyu Chen
Se-In Jang
Kuang Gong
Quanzheng Li
ViTMedIm
87
14
0
08 Feb 2023
Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic
  Data Imputation
Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Imputation
Haifang Wen
Wenzhuo Tang
Wei Jin
Jiayuan Ding
Renming Liu
Xinnan Dai
Feng Shi
Lulu Shang
Jiliang Tang
Yuying Xie
66
10
0
06 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
  and Video
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLMVLMMoE
116
171
0
01 Feb 2023
Multimodality Representation Learning: A Survey on Evolution,
  Pretraining and Its Applications
Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Muhammad Arslan Manzoor
S. Albarri
Ziting Xian
Zaiqiao Meng
Preslav Nakov
Shangsong Liang
AI4TS
101
32
0
01 Feb 2023
Zorro: the masked multimodal transformer
Zorro: the masked multimodal transformer
Adrià Recasens
Jason Lin
João Carreira
Drew Jaegle
Luyu Wang
...
Pauline Luc
Antoine Miech
Lucas Smaira
Ross Hemsley
Andrew Zisserman
92
21
0
23 Jan 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
105
4
0
05 Jan 2023
Learning by Sorting: Self-supervised Learning with Group Ordering
  Constraints
Learning by Sorting: Self-supervised Learning with Group Ordering Constraints
Nina Shvetsova
Felix Petersen
Anna Kukleva
Bernt Schiele
Hilde Kuehne
SSL
102
13
0
05 Jan 2023
Generating music with sentiment using Transformer-GANs
Generating music with sentiment using Transformer-GANs
Pedro Neves
José Fornari
J. Florindo
MGen
52
22
0
21 Dec 2022
UnICLAM:Contrastive Representation Learning with Adversarial Masking for
  Unified and Interpretable Medical Vision Question Answering
UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering
Chenlu Zhan
Peng Peng
Hongsen Wang
Tao Chen
Hongwei Wang
MedIm
77
4
0
21 Dec 2022
CLIPPO: Image-and-Language Understanding from Pixels Only
CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIPVLM
102
49
0
15 Dec 2022
FlexiViT: One Model for All Patch Sizes
FlexiViT: One Model for All Patch Sizes
Lucas Beyer
Pavel Izmailov
Alexander Kolesnikov
Mathilde Caron
Simon Kornblith
Xiaohua Zhai
Matthias Minderer
Michael Tschannen
Ibrahim Alabdulmohsin
Filip Pavetić
VLM
153
94
0
15 Dec 2022
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
108
78
0
15 Dec 2022
Efficient Self-supervised Learning with Contextualized Target
  Representations for Vision, Speech and Language
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
Alexei Baevski
Arun Babu
Wei-Ning Hsu
Michael Auli
VLMSSL
125
97
0
14 Dec 2022
Previous
12345678
Next