ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
v1v2v3 (latest)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXiv (abs)PDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 360 papers shown
Title
Semantically Guided Representation Learning For Action Anticipation
Semantically Guided Representation Learning For Action Anticipation
Anxhelo Diko
D. Avola
Bardh Prenkaj
Federico Fontana
Luigi Cinque
AI4TS
59
3
0
02 Jul 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
82
1
0
13 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal
  Hierarchical-Cross-Attention Model
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
79
0
0
12 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model
  Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
170
13
1
09 Jun 2024
Generalist Multimodal AI: A Review of Architectures, Challenges and
  Opportunities
Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities
Sai Munikoti
Ian Stewart
Sameera Horawalavithana
Henry Kvinge
Tegan H. Emerson
Sandra E Thompson
Karl Pazdernik
102
2
0
08 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
275
17
0
06 Jun 2024
Audio Mamba: Bidirectional State Space Model for Audio Representation
  Learning
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
139
25
0
05 Jun 2024
Contrasting Multiple Representations with the Multi-Marginal Matching
  Gap
Contrasting Multiple Representations with the Multi-Marginal Matching Gap
Zoe Piran
Michal Klein
James Thornton
Marco Cuturi
100
3
0
29 May 2024
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
ZeMing Gong
Austin T. Wang
Joakim Bruslund Haurum
Scott C. Lowe
Graham W. Taylor
Angel X. Chang
Angel X. Chang
130
7
0
27 May 2024
LookHere: Vision Transformers with Directed Attention Generalize and
  Extrapolate
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
A. Fuller
Daniel G. Kyrollos
Yousef Yassin
James R. Green
109
3
0
22 May 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A
  Survey
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
106
10
0
22 May 2024
Impact of Stickers on Multimodal Chat Sentiment Analysis and Intent
  Recognition: A New Task, Dataset and Baseline
Impact of Stickers on Multimodal Chat Sentiment Analysis and Intent Recognition: A New Task, Dataset and Baseline
Yuanchen Shi
Biao Ma
Fang Kong
57
0
0
14 May 2024
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual
  Question Answering
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang
Jianqin Yin
90
1
0
13 May 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
77
2
0
12 May 2024
Efficient Remote Sensing with Harmonized Transfer Learning and Modality
  Alignment
Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment
Tengjun Huang
117
0
0
28 Apr 2024
A review of deep learning-based information fusion techniques for
  multimodal medical image classification
A review of deep learning-based information fusion techniques for multimodal medical image classification
Yi-Hsuan Li
Mostafa EL HABIB DAHO
Pierre-Henri Conze
Rachid Zeghlache
Hugo Le Boité
R. Tadayoni
B. Cochener
M. Lamard
G. Quellec
65
46
0
23 Apr 2024
General Item Representation Learning for Cold-start Content
  Recommendations
General Item Representation Learning for Cold-start Content Recommendations
Jooeun Kim
Jinri Kim
Kwangeun Yeo
Eungi Kim
Kyoung-Woon On
Jonghwan Mun
Joonseok Lee
VLM
57
1
0
22 Apr 2024
A Survey on Multimodal Wearable Sensor-based Human Action Recognition
A Survey on Multimodal Wearable Sensor-based Human Action Recognition
Jianyuan Ni
Hao Tang
Syed Tousiful Haque
Yan Yan
A. Ngu
122
9
0
14 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric
  Videos
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoVSSL
86
7
0
08 Apr 2024
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition
Yash Jain
David M. Chan
Pranav Dheram
Aparna Khare
Olabanji Shonibare
Venkatesh Ravichandran
Shalini Ghosh
75
2
0
28 Mar 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
85
7
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
148
7
0
28 Mar 2024
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang
Guohao Sun
Pichao Wang
Dongfang Liu
S. Dianat
Majid Rabbani
Raghuveer M. Rao
Zhiqiang Tao
VGen
114
26
0
26 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for
  Audio-Text Cross Retrieval
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
65
1
0
16 Mar 2024
Unsupervised Modality-Transferable Video Highlight Detection with
  Representation Activation Sequence Learning
Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning
Tingtian Li
Zixun Sun
Xinyu Xiao
70
3
0
14 Mar 2024
Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction
Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction
Jianping Jiang
Xinyu Zhou
Bingxuan Wang
Xiaoming Deng
Chao Xu
Boxin Shi
96
7
0
12 Mar 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
123
36
0
20 Feb 2024
Revisiting Feature Prediction for Learning Visual Representations from
  Video
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDEVLM
155
87
0
15 Feb 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for
  Short-form Video Humor Detection
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
51
5
0
14 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
95
6
0
14 Feb 2024
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data
Shufan Li
Harkanwar Singh
Aditya Grover
Mamba
173
64
0
08 Feb 2024
Multimodal Neurodegenerative Disease Subtyping Explained by ChatGPT
Multimodal Neurodegenerative Disease Subtyping Explained by ChatGPT
Diego Machado Reyes
Hanqing Chao
Juergen Hahn
Li Shen
Pingkun Yan
12
2
0
31 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
164
216
0
24 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
76
0
0
22 Jan 2024
Cascaded Cross-Modal Transformer for Audio-Textual Classification
Cascaded Cross-Modal Transformer for Audio-Textual Classification
Nicolae-Cătălin Ristea
Andrei Anghel
Radu Tudor Ionescu
96
2
0
15 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
  Classification
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
87
5
0
08 Jan 2024
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for
  Audio-Video Classification
Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
60
4
0
08 Jan 2024
Joint Self-Supervised and Supervised Contrastive Learning for Multimodal
  MRI Data: Towards Predicting Abnormal Neurodevelopment
Joint Self-Supervised and Supervised Contrastive Learning for Multimodal MRI Data: Towards Predicting Abnormal Neurodevelopment
Zhiyuan Li
Hailong Li
Anca L. Ralescu
Jonathan R. Dillman
M. Altaye
Kim M. Cecil
N. Parikh
Lili He
63
2
0
22 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language
  Fusion
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Tianlin Li
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
70
16
0
17 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
67
44
0
11 Dec 2023
A Comprehensive Study of Vision Transformers in Image Classification
  Tasks
A Comprehensive Study of Vision Transformers in Image Classification Tasks
Mahmoud Khalil
Ahmad Khalil
A. Ngom
ViT
62
10
0
02 Dec 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in
  Video-Language Models
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
94
19
0
13 Nov 2023
Towards A Unified Neural Architecture for Visual Recognition and
  Reasoning
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Calvin Luo
Boqing Gong
Ting Chen
Chen Sun
OCLObjD
52
1
0
10 Nov 2023
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Hierarchically Gated Recurrent Neural Network for Sequence Modeling
Zhen Qin
Aaron Courville
Yiran Zhong
90
80
0
08 Nov 2023
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharing
Siddharth Srivastava
Gaurav Sharma
SSL
95
67
0
07 Nov 2023
Object-centric Video Representation for Long-term Action Anticipation
Object-centric Video Representation for Long-term Action Anticipation
Ce Zhang
Changcheng Fu
Shijie Wang
Nakul Agarwal
Kwonjoon Lee
Chiho Choi
Chen Sun
115
17
0
31 Oct 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural
  Languages
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
Mohammad Akbari
Saeed Ranjbar Alvar
Behnam Kamranian
Amin Banitalebi-Dehkordi
Yong Zhang
AI4CE
31
0
0
26 Oct 2023
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Shih-yang Liu
Zechun Liu
Xijie Huang
Pingcheng Dong
Kwang-Ting Cheng
MQ
94
64
0
25 Oct 2023
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity
  and Relation Extraction
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction
Xuming Hu
Junzhe Chen
Aiwei Liu
Shiao Meng
Lijie Wen
Philip S. Yu
79
18
0
25 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
74
10
0
25 Oct 2023
Previous
12345678
Next