ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
v1v2v3 (latest)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXiv (abs)PDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 360 papers shown
Title
On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
Seongbo Jang
Seonghyeon Lee
Dongha Lee
Hwanjo Yu
10
0
0
13 Jun 2025
$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
AVROBUSTBENCH\texttt{AVROBUSTBENCH}AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
Sarthak Kumar Maharana
Saksham Singh Kushwaha
Baoming Zhang
Adrian Rodriguez
Songtao Wei
Yapeng Tian
Yunhui Guo
TTAVLM
23
0
0
31 May 2025
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
24
0
0
29 May 2025
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
187
0
0
05 May 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
164
1
0
28 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Symbolic Representation for Any-to-Any Generative Tasks
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
78
0
0
24 Apr 2025
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Yuling Jiao
Yanming Lai
Yang Wang
Bokai Yan
62
0
0
18 Apr 2025
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
Tinh-Anh Nguyen-Nhu
H. Tran
Nguyen-Khang Le
Minh-Nhat Nguyen
T. Nguyen
...
Huu-Phong Phan-Nguyen
Huy-Thach Pham
Quan Nguyen
Hoang M. Le
Quang-Vinh Dinh
99
0
0
12 Apr 2025
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Breaking the Barriers: Video Vision Transformers for Word-Level Sign Language Recognition
Alexander Brettmann
Jakob Grävinghoff
Marlene Rüschoff
Marie Westhues
SLR
86
0
0
10 Apr 2025
Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification
Jiahang Li
Shibo Xue
Yong Su
75
0
0
08 Apr 2025
Continual Cross-Modal Generalization
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
101
0
0
01 Apr 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
74
0
0
10 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
122
4
0
01 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
142
0
0
20 Feb 2025
Simpler Fast Vision Transformers with a Jumbo CLS Token
Simpler Fast Vision Transformers with a Jumbo CLS Token
A. Fuller
Yousef Yassin
Daniel G. Kyrollos
Evan Shelhamer
James R. Green
203
0
0
20 Feb 2025
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala
Vasileios Tsouvalas
N. Meratnia
MoE
90
0
0
10 Feb 2025
CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets
CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets
Tanay Agrawal
Mohammed Guermal
Michal Balazia
François Brémond
68
0
0
08 Jan 2025
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
118
2
0
26 Dec 2024
A Concept-Centric Approach to Multi-Modality Learning
A Concept-Centric Approach to Multi-Modality Learning
Yuchong Geng
Ao Tang
158
0
0
18 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
CrossVIT-augmented Geospatial-Intelligence Visualization System for
  Tracking Economic Development Dynamics
CrossVIT-augmented Geospatial-Intelligence Visualization System for Tracking Economic Development Dynamics
Yanbing Bai
Jinhua Su
Bin Qiao
Xiaoran Ma
117
0
0
13 Dec 2024
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal
  Latent Alignment
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
Kim Sung-Bin
Arda Senocak
Hyunwoo Ha
Tae-Hyun Oh
DiffM
219
0
0
09 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
186
0
0
24 Nov 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring
  and Poisoning via Fine-grained Knowledge Alignment
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam
Christopher Thomas
AAML
180
3
0
23 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
175
0
0
18 Nov 2024
Multi-Modal interpretable automatic video captioning
Multi-Modal interpretable automatic video captioning
Antoine Hanna-Asaad
Decky Aspandi
Titus Zaharia
65
0
0
11 Nov 2024
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization
Rohan Choudhury
Guanglei Zhu
Sihan Liu
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
83
14
0
07 Nov 2024
Unified Speech Recognition: A Single Model for Auditory, Visual, and
  Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
Maja Pantic
SSL
86
7
0
04 Nov 2024
Contrasting with Symile: Simple Model-Agnostic Representation Learning
  for Unlimited Modalities
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
A. Saporta
A. Puli
Mark Goldstein
Rajesh Ranganath
SSL
85
1
0
01 Nov 2024
Video Token Merging for Long-form Video Understanding
Video Token Merging for Long-form Video Understanding
Seon-Ho Lee
Jue Wang
Zhikang Zhang
D. Fan
Xinyu Li
92
6
0
31 Oct 2024
Multimodal Learning for Embryo Viability Prediction in Clinical IVF
Multimodal Learning for Embryo Viability Prediction in Clinical IVF
Junsik Kim
Zhiyi Shi
Davin Jeong
Johannes Knittel
H. Yang
...
Wanhua Li
Yicong Li
D. Ben-Yosef
D. Needleman
Hanspeter Pfister
96
2
0
21 Oct 2024
OmnixR: Evaluating Omni-modality Language Models on Reasoning across
  Modalities
OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities
Lawrence Yunliang Chen
Hexiang Hu
Ruotong Wang
Yiran Chen
Zifeng Wang
...
Pranav Shyam
Tianyi Zhou
Heng-Chiao Huang
Ming-Hsuan Yang
Boqing Gong
38
3
0
16 Oct 2024
On-the-fly Modulation for Balanced Multimodal Learning
On-the-fly Modulation for Balanced Multimodal Learning
Yake Wei
D. Hu
Henghui Du
Ji-Rong Wen
52
11
0
15 Oct 2024
Multi-Stage Graph Learning for fMRI Analysis to Diagnose
  Neuro-Developmental Disorders
Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders
Wenjing Gao
Yuanyuan Yang
Jianrui Wei
Xuntao Yin
Xinhan Di
49
0
0
07 Oct 2024
DocKD: Knowledge Distillation from LLMs for Open-World Document
  Understanding Models
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
Sungnyun Kim
Haofu Liao
Srikar Appalaraju
Peng Tang
Zhuowen Tu
R. Satzoda
R. Manmatha
Vijay Mahadevan
Stefano Soatto
104
0
0
04 Oct 2024
Survival Prediction in Lung Cancer through Multi-Modal Representation
  Learning
Survival Prediction in Lung Cancer through Multi-Modal Representation Learning
Aiman Farooq
Deepak Mishra
S. Chaudhury
62
2
0
30 Sep 2024
Beyond Redundancy: Information-aware Unsupervised Multiplex Graph
  Structure Learning
Beyond Redundancy: Information-aware Unsupervised Multiplex Graph Structure Learning
Zhixiang Shen
Shuo Wang
Zhao Kang
128
4
0
25 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
105
0
0
17 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
105
6
0
11 Sep 2024
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose
  Representation
PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation
Ginger Delmas
Philippe Weinzaepfel
Francesc Moreno-Noguer
Grégory Rogez
65
2
0
10 Sep 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-Xiong Wang
140
23
0
05 Sep 2024
When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided
  Unsupervised Representation Learning
When Heterophily Meets Heterogeneous Graphs: Latent Graphs Guided Unsupervised Representation Learning
Zhixiang Shen
Zhao Kang
88
4
0
01 Sep 2024
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
ICSD: An Open-source Dataset for Infant Cry and Snoring Detection
Qingyu Liu
Longfei Song
Dongxing Xu
Yanhua Long
90
0
0
20 Aug 2024
UniPortrait: A Unified Framework for Identity-Preserving Single- and
  Multi-Human Image Personalization
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He
Yifeng Geng
Liefeng Bo
DiffM
117
23
0
12 Aug 2024
Cross-Modality Clustering-based Self-Labeling for Multimodal Data
  Classification
Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification
P. Zyblewski
Leandro L. Minku
73
0
0
05 Aug 2024
Exploring Robust Face-Voice Matching in Multilingual Environments
Exploring Robust Face-Voice Matching in Multilingual Environments
Jiehui Tang
Xiaofei Wang
Zhen Xiao
Jiayi Liu
Xueliang Liu
Richang Hong
CVBM
83
0
0
29 Jul 2024
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function
  Coupling DNA Sequences and Natural Language
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language
Yuchen Zhang
Ratish Kumar Chandrakant Jha
Soumya Bharadwaj
Vatsal Sanjaykumar Thakkar
Adrienne Hoarfrost
Jin Sun
55
1
0
21 Jul 2024
VideoMamba: Spatio-Temporal Selective State Space Model
VideoMamba: Spatio-Temporal Selective State Space Model
Jinyoung Park
Hee-Seon Kim
Kangwook Ko
Minbeom Kim
Changick Kim
Mamba
124
9
0
11 Jul 2024
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for
  Text-to-Image Generation?
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen
Yichao Du
Zichen Wen
Yiyang Zhou
Chenhang Cui
...
Jiawei Zhou
Zhuokai Zhao
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVMMLLM
117
35
0
05 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
139
6
0
04 Jul 2024
12345678
Next