ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.11178
  4. Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
v1v2v3 (latest)

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
    ViT
ArXiv (abs)PDFHTML

Papers citing "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

50 / 360 papers shown
Title
OAMixer: Object-aware Mixing Layer for Vision Transformers
OAMixer: Object-aware Mixing Layer for Vision Transformers
H. Kang
Sangwoo Mo
Jinwoo Shin
VLM
116
4
0
13 Dec 2022
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu
Eduardo Fonseca
Radu Tudor Ionescu
Mario Lucic
Cordelia Schmid
Anurag Arnab
SSL
118
45
0
09 Dec 2022
Deep Architectures for Content Moderation and Movie Content Rating
Deep Architectures for Content Moderation and Movie Content Rating
Fatih Çagatay Akyön
A. Temi̇zel
80
5
0
08 Dec 2022
Multimodal Vision Transformers with Forced Attention for Behavior
  Analysis
Multimodal Vision Transformers with Forced Attention for Behavior Analysis
Tanay Agrawal
Michal Balazia
Philippe Muller
Franccois Brémond
ViT
82
9
0
07 Dec 2022
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation
See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation
Hao Li
Yizhi Zhang
Junzhe Zhu
Shaoxiong Wang
Michelle A. Lee
Huazhe Xu
Edward H. Adelson
Li Fei-Fei
Ruohan Gao
Jiajun Wu
67
64
0
07 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
84
58
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
169
332
0
06 Dec 2022
VLG: General Video Recognition with Web Textual Knowledge
VLG: General Video Recognition with Web Textual Knowledge
Jintao Lin
Zhaoyang Liu
Wenhai Wang
Wayne Wu
Limin Wang
94
1
0
03 Dec 2022
Multimodal Query-guided Object Localization
Multimodal Query-guided Object Localization
Aditay Tripathi
Rajath R Dani
Anand Mishra
Anirban Chakraborty
60
0
0
01 Dec 2022
Spatio-Temporal Crop Aggregation for Video Representation Learning
Spatio-Temporal Crop Aggregation for Video Representation Learning
Sepehr Sameni
Simon Jenni
Paolo Favaro
94
3
0
30 Nov 2022
Survey on Self-Supervised Multimodal Representation Learning and
  Foundation Models
Survey on Self-Supervised Multimodal Representation Learning and Foundation Models
Sushil Thapa
AI4TSSSL
48
1
0
29 Nov 2022
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video
  Representation Learning
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar
Ali Etemad
112
23
0
25 Nov 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic
  Completion Learning
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
78
15
0
24 Nov 2022
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
  Speech Representation Learning
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiu-shi Zhu
Long Zhou
Zi-Hua Zhang
Shujie Liu
Binxing Jiao
Jie Zhang
Lirong Dai
Daxin Jiang
Jinyu Li
Furu Wei
111
38
0
21 Nov 2022
Token Turing Machines
Token Turing Machines
Michael S. Ryoo
K. Gopalakrishnan
Kumara Kahatapitiya
Ted Xiao
Kanishka Rao
Austin Stone
Yao Lu
Julian Ibarz
Anurag Arnab
61
21
0
16 Nov 2022
Self-supervised remote sensing feature learning: Learning Paradigms,
  Challenges, and Future Works
Self-supervised remote sensing feature learning: Learning Paradigms, Challenges, and Future Works
Chao Tao
Ji Qi
Mingning Guo
Qing Zhu
Haifeng Li
SSL
104
59
0
15 Nov 2022
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
  Localization
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Dennis Fedorishin
D. Mohan
Bhavin Jawade
S. Setlur
V. Govindaraju
VGen
113
11
0
06 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient
  Harmonization
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zhangyang Wang
Cong Yu
70
10
0
03 Nov 2022
Changes from Classical Statistics to Modern Statistics and Data Science
Changes from Classical Statistics to Modern Statistics and Data Science
Kai Zhang
Shan-Yu Liu
M. Xiong
87
0
0
30 Oct 2022
Multimodal Transformer for Parallel Concatenated Variational
  Autoencoders
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
Stephen D. Liang
J. Mendel
ViT
66
5
0
28 Oct 2022
Grafting Vision Transformers
Grafting Vision Transformers
Jong Sung Park
Kumara Kahatapitiya
Donghyun Kim
Shivchander Sudalairaj
Quanfu Fan
Michael S. Ryoo
ViT
97
3
0
28 Oct 2022
M$^3$Care: Learning with Missing Modalities in Multimodal Healthcare
  Data
M3^33Care: Learning with Missing Modalities in Multimodal Healthcare Data
Chaohe Zhang
Xu Chu
Liantao Ma
Yinghao Zhu
Yasha Wang
Jiangtao Wang
Junfeng Zhao
80
87
0
28 Oct 2022
Learning Joint Representation of Human Motion and Language
Learning Joint Representation of Human Motion and Language
Jihoon Kim
Youngjae Yu
Seungyoung Shin
Taehyun Byun
Sungjoon Choi
77
5
0
27 Oct 2022
Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
83
25
0
27 Oct 2022
Scratching Visual Transformer's Back with Uniform Attention
Scratching Visual Transformer's Back with Uniform Attention
Nam Hyeon-Woo
Kim Yu-Ji
Byeongho Heo
Doonyoon Han
Seong Joon Oh
Tae-Hyun Oh
530
23
0
16 Oct 2022
Linear Video Transformer with Feature Fixation
Linear Video Transformer with Feature Fixation
Kaiyue Lu
Zexia Liu
Jianyuan Wang
Weixuan Sun
Zhen Qin
...
Xuyang Shen
Huizhong Deng
Xiaodong Han
Yuchao Dai
Yiran Zhong
110
5
0
15 Oct 2022
S4ND: Modeling Images and Videos as Multidimensional Signals Using State
  Spaces
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces
Eric N. D. Nguyen
Karan Goel
Albert Gu
Gordon W. Downs
Preey Shah
Tri Dao
S. Baccus
Christopher Ré
VLM
93
40
0
12 Oct 2022
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
A. Fragomeni
Michael Wray
Dima Damen
CLIPViT
56
4
0
09 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
105
19
0
05 Oct 2022
Contrastive Audio-Visual Masked Autoencoder
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong
Andrew Rouditchenko
Alexander H. Liu
David Harwath
Leonid Karlinsky
Hilde Kuehne
James R. Glass
122
128
0
02 Oct 2022
Data Poisoning Attacks Against Multimodal Encoders
Data Poisoning Attacks Against Multimodal Encoders
Ziqing Yang
Xinlei He
Zheng Li
Michael Backes
Mathias Humbert
Pascal Berrang
Yang Zhang
AAML
176
52
0
30 Sep 2022
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
137
31
0
28 Sep 2022
Multi-dataset Training of Transformers for Robust Action Recognition
Multi-dataset Training of Transformers for Robust Action Recognition
Junwei Liang
Enwei Zhang
Jun Zhang
Chunhua Shen
ViT
123
11
0
26 Sep 2022
Unsupervised Hashing with Semantic Concept Mining
Unsupervised Hashing with Semantic Concept Mining
Rong-Cheng Tu
Xian-Ling Mao
Kevin Qinghong Lin
Chengfei Cai
Weize Qin
Hongfa Wang
Wei Wei
Heyan Huang
122
12
0
23 Sep 2022
PACT: Perception-Action Causal Transformer for Autoregressive Robotics
  Pre-Training
PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
Rogerio Bonatti
Sai H. Vemprala
Shuang Ma
Felipe Vieira Frujeri
Shuhang Chen
Ashish Kapoor
94
23
0
22 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLMVLM
136
153
0
15 Sep 2022
Distribution Aware Metrics for Conditional Natural Language Generation
Distribution Aware Metrics for Conditional Natural Language Generation
David M. Chan
Yiming Ni
David A. Ross
Sudheendra Vijayanarasimhan
Austin Myers
John F. Canny
77
4
0
15 Sep 2022
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
Yi Wang
Zhiwen Fan
Tianlong Chen
Hehe Fan
Zhangyang Wang
ViT
107
10
0
15 Sep 2022
Vision Transformers for Action Recognition: A Survey
Vision Transformers for Action Recognition: A Survey
Anwaar Ulhaq
Naveed Akhtar
Ganna Pogrebna
Ajmal Mian
ViT
82
45
0
13 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles,
  Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
114
88
0
07 Sep 2022
Fusion of Satellite Images and Weather Data with Transformer Networks
  for Downy Mildew Disease Detection
Fusion of Satellite Images and Weather Data with Transformer Networks for Downy Mildew Disease Detection
William Maillet
Maryam Ouhami
A. Hafiane
ViTMedIm
39
7
0
06 Sep 2022
RLIP: Relational Language-Image Pre-training for Human-Object
  Interaction Detection
RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan
Jianwen Jiang
Samuel Albanie
Tao Feng
Ziyuan Huang
Dong Ni
Mingqian Tang
VLM
110
55
0
05 Sep 2022
Topic Detection in Continuous Sign Language Videos
Topic Detection in Continuous Sign Language Videos
Álvaro Budria
Laia Tarrés
Gerard I. Gállego
Francesc Moreno-Noguer
Jordi Torres
Xavier Giró-i-Nieto
SLRVLM
91
1
0
01 Sep 2022
Learning in Audio-visual Context: A Review, Analysis, and New
  Perspective
Learning in Audio-visual Context: A Review, Analysis, and New Perspective
Yake Wei
Di Hu
Yapeng Tian
Xuelong Li
135
55
0
20 Aug 2022
GPPF: A General Perception Pre-training Framework via Sparsely Activated
  Multi-Task Learning
GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning
Benyuan Sun
Jinqiao Dai
Zihao Liang
Cong Liu
Yi Yang
Bo Bai
MoE
75
4
0
03 Aug 2022
COCOA: Cross Modality Contrastive Learning for Sensor Data
COCOA: Cross Modality Contrastive Learning for Sensor Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Daniel V. Smith
Flora D. Salim
SSL
94
40
0
31 Jul 2022
UAVM: Towards Unifying Audio and Visual Models
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
75
23
0
29 Jul 2022
AutoTransition: Learning to Recommend Video Transition Effects
AutoTransition: Learning to Recommend Video Transition Effects
Yaojie Shen
Libo Zhang
Kai Xu
Xiaojie Jin
VGen
55
13
0
27 Jul 2022
Learning Visual Representation from Modality-Shared Contrastive
  Language-Image Pre-training
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIPVLM
85
47
0
26 Jul 2022
Remote Medication Status Prediction for Individuals with Parkinson's
  Disease using Time-series Data from Smartphones
Remote Medication Status Prediction for Individuals with Parkinson's Disease using Time-series Data from Smartphones
Weijian Li
Wei-wei Zhu
R. Dorsey
Jiebo Luo
59
0
0
26 Jul 2022
Previous
12345678
Next