ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.01008
  4. Cited By
Self-Supervised Multimodal Learning: A Survey
v1v2 (latest)

Self-Supervised Multimodal Learning: A Survey

31 March 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
    SSL
ArXiv (abs)PDFHTMLGithub (254★)

Papers citing "Self-Supervised Multimodal Learning: A Survey"

27 / 127 papers shown
Title
VideoBERT: A Joint Model for Video and Language Representation Learning
VideoBERT: A Joint Model for Video and Language Representation Learning
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLMSSL
82
1,250
0
03 Apr 2019
Unpaired Image Captioning via Scene Graph Alignments
Unpaired Image Captioning via Scene Graph Alignments
Jiuxiang Gu
Shafiq Joty
Jianfei Cai
Handong Zhao
Xu Yang
G. Wang
GNN
73
174
0
26 Mar 2019
The Missing Ingredient in Zero-Shot Neural Machine Translation
The Missing Ingredient in Zero-Shot Neural Machine Translation
N. Arivazhagan
Ankur Bapna
Orhan Firat
Roee Aharoni
Melvin Johnson
Wolfgang Macherey
83
117
0
17 Mar 2019
Self-supervised Visual Feature Learning with Deep Neural Networks: A
  Survey
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Longlong Jing
Yingli Tian
SSL
171
1,701
0
16 Feb 2019
From Recognition to Cognition: Visual Commonsense Reasoning
From Recognition to Cognition: Visual Commonsense Reasoning
Rowan Zellers
Yonatan Bisk
Ali Farhadi
Yejin Choi
LRMBDLOCLReLM
183
883
0
27 Nov 2018
Learning Latent Dynamics for Planning from Pixels
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner
Timothy Lillicrap
Ian S. Fischer
Ruben Villegas
David R Ha
Honglak Lee
James Davidson
BDL
92
1,448
0
12 Nov 2018
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal
  Representations for Contact-Rich Tasks
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee
Yuke Zhu
K. Srinivasan
Parth Shah
Silvio Savarese
Li Fei-Fei
Animesh Garg
Jeannette Bohg
SSL
91
370
0
24 Oct 2018
Cooperative Learning of Audio and Video Models from Self-Supervised
  Synchronization
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
Bruno Korbar
Du Tran
Lorenzo Torresani
99
476
0
30 Jun 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Andrew Owens
Alexei A. Efros
SSL
100
754
0
10 Apr 2018
The Sound of Pixels
The Sound of Pixels
Hang Zhao
Chuang Gan
Andrew Rouditchenko
Carl Vondrick
Josh H. McDermott
Antonio Torralba
VLM
102
537
0
09 Apr 2018
Scene Graph Parsing as Dependency Parsing
Scene Graph Parsing as Dependency Parsing
Yu-Siang Wang
Chenxi Liu
Fangyin Wei
Alan Yuille
GNN3DV
45
53
0
25 Mar 2018
VizWiz Grand Challenge: Answering Visual Questions from Blind People
VizWiz Grand Challenge: Answering Visual Questions from Blind People
Danna Gurari
Qing Li
Abigale Stangl
Anhong Guo
Chi Lin
Kristen Grauman
Jiebo Luo
Jeffrey P. Bigham
CoGe
114
862
0
22 Feb 2018
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjDVOS
113
530
0
18 Dec 2017
Don't Just Assume; Look and Answer: Overcoming Priors for Visual
  Question Answering
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Aishwarya Agrawal
Dhruv Batra
Devi Parikh
Aniruddha Kembhavi
OOD
155
586
0
01 Dec 2017
Unsupervised Machine Translation Using Monolingual Corpora Only
Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample
Alexis Conneau
Ludovic Denoyer
MarcÁurelio Ranzato
SSL
146
1,097
0
31 Oct 2017
Localizing Moments in Video with Natural Language
Localizing Moments in Video with Natural Language
Lisa Anne Hendricks
Oliver Wang
Eli Shechtman
Josef Sivic
Trevor Darrell
Bryan C. Russell
125
949
0
04 Aug 2017
Multimodal Machine Learning: A Survey and Taxonomy
Multimodal Machine Learning: A Survey and Taxonomy
T. Baltrušaitis
Chaitanya Ahuja
Louis-Philippe Morency
116
2,939
0
26 May 2017
Look, Listen and Learn
Look, Listen and Learn
Relja Arandjelović
Andrew Zisserman
SSL
127
906
0
23 May 2017
Dense-Captioning Events in Videos
Dense-Captioning Events in Videos
Ranjay Krishna
Kenji Hata
F. Ren
Li Fei-Fei
Juan Carlos Niebles
144
1,250
0
02 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
Y. Jang
Yale Song
Youngjae Yu
Youngjin Kim
Gunhee Kim
87
561
0
14 Apr 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
Towards Automatic Learning of Procedures from Web Instructional Videos
Luowei Zhou
Chenliang Xu
Jason J. Corso
EgoV
75
831
0
28 Mar 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
352
3,273
0
02 Dec 2016
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
Aaron van den Oord
Nal Kalchbrenner
Oriol Vinyals
L. Espeholt
Alex Graves
Koray Kavukcuoglu
VLM
214
2,519
0
16 Jun 2016
Unsupervised Learning of Visual Representations by Solving Jigsaw
  Puzzles
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
M. Noroozi
Paolo Favaro
SSL
177
2,986
0
30 Mar 2016
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
208
2,074
0
19 May 2015
VQA: Visual Question Answering
VQA: Visual Question Answering
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
226
5,509
0
03 May 2015
Understanding image representations by measuring their equivariance and
  equivalence
Understanding image representations by measuring their equivariance and equivalence
Karel Lenc
Andrea Vedaldi
SSLFAtt
116
537
0
21 Nov 2014
Previous
123