Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.11178
Cited By
v1
v2
v3 (latest)
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"
50 / 360 papers shown
Title
On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
Seongbo Jang
Seonghyeon Lee
Dongha Lee
Hwanjo Yu
10
0
0
13 Jun 2025
AVROBUSTBENCH
\texttt{AVROBUSTBENCH}
AVROBUSTBENCH
: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana
Saksham Singh Kushwaha
Baoming Zhang
Adrian Rodriguez
Songtao Wei
Yapeng Tian
Yunhui Guo
TTA
VLM
23
0
0
31 May 2025
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
24
0
0
29 May 2025
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
187
0
0
05 May 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
164
1
0
28 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
78
0
0
24 Apr 2025
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Yuling Jiao
Yanming Lai
Yang Wang
Bokai Yan
62
0
0
18 Apr 2025
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Tinh-Anh Nguyen-Nhu
H. Tran
Nguyen-Khang Le
Minh-Nhat Nguyen
T. Nguyen
...
Huu-Phong Phan-Nguyen
Huy-Thach Pham
Quan Nguyen
Hoang M. Le
Quang-Vinh Dinh
99
0
0
12 Apr 2025
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Alexander Brettmann
Jakob Grävinghoff
Marlene Rüschoff
Marie Westhues
SLR
86
0
0
10 Apr 2025
Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
Jiahang Li
Shibo Xue
Yong Su
75
0
0
08 Apr 2025
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
101
0
0
01 Apr 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
74
0
0
10 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
122
4
0
01 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
142
0
0
20 Feb 2025
Simpler Fast Vision Transformers with a Jumbo CLS Token
A. Fuller
Yousef Yassin
Daniel G. Kyrollos
Evan Shelhamer
James R. Green
203
0
0
20 Feb 2025
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala
Vasileios Tsouvalas
N. Meratnia
MoE
90
0
0
10 Feb 2025
CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets
Tanay Agrawal
Mohammed Guermal
Michal Balazia
François Brémond
68
0
0
08 Jan 2025
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
118
2
0
26 Dec 2024
A Concept-Centric Approach to Multi-Modality Learning
Yuchong Geng
Ao Tang
158
0
0
18 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
CrossVIT-augmented Geospatial-Intelligence Visualization System for Tracking Economic Development Dynamics
Yanbing Bai
Jinhua Su
Bin Qiao
Xiaoran Ma
117
0
0
13 Dec 2024
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
Kim Sung-Bin
Arda Senocak
Hyunwoo Ha
Tae-Hyun Oh
DiffM
219
0
0
09 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
186
0
0
24 Nov 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam
Christopher Thomas
AAML
180
3
0
23 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
175
0
0
18 Nov 2024
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
65
0
0
11 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
83
14
0
07 Nov 2024
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
Maja Pantic
SSL
86
7
0
04 Nov 2024
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
A. Saporta
A. Puli
Mark Goldstein
Rajesh Ranganath
SSL
85
1
0
01 Nov 2024
Video Token Merging for Long-form Video Understanding
Seon-Ho Lee
Jue Wang
Zhikang Zhang
D. Fan
Xinyu Li
92
6
0
31 Oct 2024
Multimodal Learning for Embryo Viability Prediction in Clinical IVF
Junsik Kim
Zhiyi Shi
Davin Jeong
Johannes Knittel
H. Yang
...
Wanhua Li
Yicong Li
D. Ben-Yosef
D. Needleman
Hanspeter Pfister
96
2
0
21 Oct 2024
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
Lawrence Yunliang Chen
Hexiang Hu
Ruotong Wang
Yiran Chen
Zifeng Wang
...
Pranav Shyam
Tianyi Zhou
Heng-Chiao Huang
Ming-Hsuan Yang
Boqing Gong
38
3
0
16 Oct 2024
On-the-fly Modulation for Balanced Multimodal Learning
Yake Wei
D. Hu
Henghui Du
Ji-Rong Wen
52
11
0
15 Oct 2024
Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders
Wenjing Gao
Yuanyuan Yang
Jianrui Wei
Xuntao Yin
Xinhan Di
49
0
0
07 Oct 2024
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
Sungnyun Kim
Haofu Liao
Srikar Appalaraju
Peng Tang
Zhuowen Tu
R. Satzoda
R. Manmatha
Vijay Mahadevan
Stefano Soatto
104
0
0
04 Oct 2024
Survival Prediction in Lung Cancer through Multi-Modal Representation Learning
Aiman Farooq
Deepak Mishra
S. Chaudhury
62
2
0
30 Sep 2024
Beyond Redundancy: Information-aware Unsupervised Multiplex Graph Structure Learning
Zhixiang Shen
Shuo Wang
Zhao Kang
128
4
0
25 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
105
0
0
17 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
105
6
0
11 Sep 2024
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
Ginger Delmas
Philippe Weinzaepfel
Francesc Moreno-Noguer
Grégory Rogez
65
2
0
10 Sep 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-Xiong Wang
140
23
0
05 Sep 2024
When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided Unsupervised Representation Learning
Zhixiang Shen
Zhao Kang
88
4
0
01 Sep 2024
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
Qingyu Liu
Longfei Song
Dongxing Xu
Yanhua Long
90
0
0
20 Aug 2024
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He
Yifeng Geng
Liefeng Bo
DiffM
117
23
0
12 Aug 2024
Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification
P. Zyblewski
Leandro L. Minku
73
0
0
05 Aug 2024
Exploring Robust Face-Voice Matching in Multilingual Environments
Jiehui Tang
Xiaofei Wang
Zhen Xiao
Jiayi Liu
Xueliang Liu
Richang Hong
CVBM
83
0
0
29 Jul 2024
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language
Yuchen Zhang
Ratish Kumar Chandrakant Jha
Soumya Bharadwaj
Vatsal Sanjaykumar Thakkar
Adrienne Hoarfrost
Jin Sun
55
1
0
21 Jul 2024
VideoMamba: Spatio-Temporal Selective State Space Model
Jinyoung Park
Hee-Seon Kim
Kangwook Ko
Minbeom Kim
Changick Kim
Mamba
124
9
0
11 Jul 2024
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen
Yichao Du
Zichen Wen
Yiyang Zhou
Chenhang Cui
...
Jiawei Zhou
Zhuokai Zhao
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVM
MLLM
117
35
0
05 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
139
6
0
04 Jul 2024
1
2
3
4
5
6
7
8
Next