v1v2 (latest)

Self-Supervised Multimodal Learning: A Survey

31 March 2023

Yongshuo Zong

Oisin Mac Aodha

Timothy M. Hospedales

SSL

ArXiv (abs)PDF HTML Github (254★)

Papers citing "Self-Supervised Multimodal Learning: A Survey"

27 / 127 papers shown

Title
VideoBERT: A Joint Model for Video and Language Representation Learning Chen Sun Austin Myers Carl Vondrick Kevin Patrick Murphy Cordelia Schmid VLM SSL 82 1,250 0 03 Apr 2019
Unpaired Image Captioning via Scene Graph Alignments Jiuxiang Gu Shafiq Joty Jianfei Cai Handong Zhao Xu Yang G. Wang GNN 73 174 0 26 Mar 2019
The Missing Ingredient in Zero-Shot Neural Machine Translation N. Arivazhagan Ankur Bapna Orhan Firat Roee Aharoni Melvin Johnson Wolfgang Macherey 83 117 0 17 Mar 2019
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey Longlong Jing Yingli Tian SSL 171 1,701 0 16 Feb 2019
From Recognition to Cognition: Visual Commonsense Reasoning Rowan Zellers Yonatan Bisk Ali Farhadi Yejin Choi LRM BDL OCL ReLM 183 883 0 27 Nov 2018
Learning Latent Dynamics for Planning from Pixels Danijar Hafner Timothy Lillicrap Ian S. Fischer Ruben Villegas David R Ha Honglak Lee James Davidson BDL 92 1,448 0 12 Nov 2018
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks Michelle A. Lee Yuke Zhu K. Srinivasan Parth Shah Silvio Savarese Li Fei-Fei Animesh Garg Jeannette Bohg SSL 91 370 0 24 Oct 2018
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization Bruno Korbar Du Tran Lorenzo Torresani 99 476 0 30 Jun 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Andrew Owens Alexei A. Efros SSL 100 754 0 10 Apr 2018
The Sound of Pixels Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh H. McDermott Antonio Torralba VLM 102 537 0 09 Apr 2018
Scene Graph Parsing as Dependency Parsing Yu-Siang Wang Chenxi Liu Fangyin Wei Alan Yuille GNN 3DV 45 53 0 25 Mar 2018
VizWiz Grand Challenge: Answering Visual Questions from Blind People Danna Gurari Qing Li Abigale Stangl Anhong Guo Chi Lin Kristen Grauman Jiebo Luo Jeffrey P. Bigham CoGe 114 862 0 22 Feb 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 113 530 0 18 Dec 2017
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering Aishwarya Agrawal Dhruv Batra Devi Parikh Aniruddha Kembhavi OOD 155 586 0 01 Dec 2017
Unsupervised Machine Translation Using Monolingual Corpora Only Guillaume Lample Alexis Conneau Ludovic Denoyer MarcÁurelio Ranzato SSL 146 1,097 0 31 Oct 2017
Localizing Moments in Video with Natural Language Lisa Anne Hendricks Oliver Wang Eli Shechtman Josef Sivic Trevor Darrell Bryan C. Russell 125 949 0 04 Aug 2017
Multimodal Machine Learning: A Survey and Taxonomy T. Baltrušaitis Chaitanya Ahuja Louis-Philippe Morency 116 2,939 0 26 May 2017
Look, Listen and Learn Relja Arandjelović Andrew Zisserman SSL 127 906 0 23 May 2017
Dense-Captioning Events in Videos Ranjay Krishna Kenji Hata F. Ren Li Fei-Fei Juan Carlos Niebles 144 1,250 0 02 May 2017
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering Y. Jang Yale Song Youngjae Yu Youngjin Kim Gunhee Kim 87 561 0 14 Apr 2017
Towards Automatic Learning of Procedures from Web Instructional Videos Luowei Zhou Chenliang Xu Jason J. Corso EgoV 75 831 0 28 Mar 2017
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Yash Goyal Tejas Khot D. Summers-Stay Dhruv Batra Devi Parikh CoGe 352 3,273 0 02 Dec 2016
Conditional Image Generation with PixelCNN Decoders Aaron van den Oord Nal Kalchbrenner Oriol Vinyals L. Espeholt Alex Graves Koray Kavukcuoglu VLM 214 2,519 0 16 Jun 2016
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles M. Noroozi Paolo Favaro SSL 177 2,986 0 30 Mar 2016
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Bryan A. Plummer Liwei Wang Christopher M. Cervantes Juan C. Caicedo Julia Hockenmaier Svetlana Lazebnik 208 2,074 0 19 May 2015
VQA: Visual Question Answering Aishwarya Agrawal Jiasen Lu Stanislaw Antol Margaret Mitchell C. L. Zitnick Dhruv Batra Devi Parikh CoGe 226 5,509 0 03 May 2015
Understanding image representations by measuring their equivariance and equivalence Karel Lenc Andrea Vedaldi SSL FAtt 116 537 0 21 Nov 2014