ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXivPDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 131 papers shown
Title
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
59
0
0
05 May 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
84
0
0
28 Apr 2025
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Alexander Brettmann
Jakob Grävinghoff
Marlene Rüschoff
Marie Westhues
SLR
56
0
0
10 Apr 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
84
0
0
20 Feb 2025
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
45
0
0
18 Nov 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-xiong Wang
75
15
0
05 Sep 2024
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
Qingyu Liu
Longfei Song
Dongxing Xu
Yanhua Long
45
0
0
20 Aug 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Y. Guo
VGen
102
16
0
06 Jun 2024
LookHere: Vision Transformers with Directed Attention Generalize and
  Extrapolate
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
A. Fuller
Daniel G. Kyrollos
Yousef Yassin
James R. Green
52
2
0
22 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
45
1
0
13 May 2024
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Yash Jain
David M. Chan
Pranav Dheram
Aparna Khare
Olabanji Shonibare
Venkatesh Ravichandran
Shalini Ghosh
40
2
0
28 Mar 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
43
29
0
20 Feb 2024
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharing
Siddharth Srivastava
Gaurav Sharma
SSL
29
64
0
07 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
25
9
0
25 Oct 2023
AGS: An Dataset and Taxonomy for Domestic Scene Sound Event Recognition
AGS: An Dataset and Taxonomy for Domestic Scene Sound Event Recognition
Nan Che
Chenrui Liu
Fei Yu
33
0
0
30 Aug 2023
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
Zhaohui Li
Haitao Wang
Xinghua Jiang
40
1
0
14 Aug 2023
Cascaded Cross-Modal Transformer for Request and Complaint Detection
Cascaded Cross-Modal Transformer for Request and Complaint Detection
Nicolae-Cătălin Ristea
Radu Tudor Ionescu
33
3
0
27 Jul 2023
Language-based Action Concept Spaces Improve Video Self-Supervised
  Learning
Language-based Action Concept Spaces Improve Video Self-Supervised Learning
Kanchana Ranasinghe
Michael S. Ryoo
SSL
VLM
37
12
0
20 Jul 2023
Does Visual Pretraining Help End-to-End Reasoning?
Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun
Calvin Luo
Xingyi Zhou
Anurag Arnab
Cordelia Schmid
OCL
LRM
ViT
35
3
0
17 Jul 2023
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Chunhui Zhang
Xin Sun
Li Liu
Yiqian Yang
Qiong Liu
Xiaoping Zhou
Yanfeng Wang
46
15
0
07 Jul 2023
A Unified Framework for Slot based Response Generation in a Multimodal
  Dialogue System
A Unified Framework for Slot based Response Generation in a Multimodal Dialogue System
Mauajama Firdaus
Avinash Madasu
Asif Ekbal
44
7
0
27 May 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
34
21
0
25 May 2023
More Perspectives Mean Better: Underwater Target Recognition and
  Localization with Multimodal Data via Symbiotic Transformer and Multiview
  Regression
More Perspectives Mean Better: Underwater Target Recognition and Localization with Multimodal Data via Symbiotic Transformer and Multiview Regression
Shipei Liu
Xiaoya Fan
Guowei Wu
33
0
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language,
  and Speech Data
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
17
2
0
21 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
45
115
0
18 May 2023
A vector quantized masked autoencoder for audiovisual speech emotion recognition
A vector quantized masked autoencoder for audiovisual speech emotion recognition
Samir Sadok
Simon Leglaive
Renaud Séguier
SSL
79
6
0
05 May 2023
On the Opportunities and Challenges of Foundation Models for Geospatial
  Artificial Intelligence
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
32
123
0
13 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
25
37
0
09 Apr 2023
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion
Sijie Wang
Qiyu Kang
Rui She
Wei Wang
K. Zhao
Yang Song
Wee Peng Tay
31
18
0
03 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
28
38
0
31 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in
  Untrimmed Multi-Action Videos from Narrated Instructions
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
40
7
0
29 Mar 2023
Egocentric Auditory Attention Localization in Conversations
Egocentric Auditory Attention Localization in Conversations
Fiona Ryan
Hao Jiang
Abhinav Shukla
James M. Rehg
V. Ithapu
EgoV
29
16
0
28 Mar 2023
3Mformer: Multi-order Multi-mode Transformer for Skeletal Action
  Recognition
3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition
Lei Wang
Piotr Koniusz
ViT
28
45
0
25 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
42
47
0
21 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet
  Tag-guided Synthetic Data
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
71
14
0
14 Mar 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
24
10
0
12 Mar 2023
Heterogeneous Graph Learning for Acoustic Event Classification
Heterogeneous Graph Learning for Acoustic Event Classification
A. Shirian
Mona Ahmadian
Krishna Somandepalli
T. Guha
25
2
0
05 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
39
221
0
27 Feb 2023
Transformadores: Fundamentos teoricos y Aplicaciones
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
78
0
0
18 Feb 2023
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Simon Jenni
Alexander Black
John Collomosse
SSL
28
15
0
15 Feb 2023
Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic
  Data Imputation
Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Imputation
Haifang Wen
Wenzhuo Tang
Wei Jin
Jiayuan Ding
Renming Liu
Xinnan Dai
Feng Shi
Lulu Shang
Jiliang Tang
Yuying Xie
29
8
0
06 Feb 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
24
4
0
05 Jan 2023
UnICLAM:Contrastive Representation Learning with Adversarial Masking for
  Unified and Interpretable Medical Vision Question Answering
UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering
Chenlu Zhan
Peng Peng
Hongsen Wang
Tao Chen
Hongwei Wang
MedIm
23
3
0
21 Dec 2022
Efficient Self-supervised Learning with Contextualized Target
  Representations for Vision, Speech and Language
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
Alexei Baevski
Arun Babu
Wei-Ning Hsu
Michael Auli
VLM
SSL
32
92
0
14 Dec 2022
Multimodal Vision Transformers with Forced Attention for Behavior
  Analysis
Multimodal Vision Transformers with Forced Attention for Behavior Analysis
Tanay Agrawal
Michal Balazia
Philippe Muller
Franccois Brémond
ViT
23
9
0
07 Dec 2022
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation
Hao Li
Yizhi Zhang
Junzhe Zhu
Shaoxiong Wang
Michelle A. Lee
Huazhe Xu
Edward H. Adelson
Li Fei-Fei
Ruohan Gao
Jiajun Wu
32
58
0
07 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
36
54
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
57
309
0
06 Dec 2022
123
Next