Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment Edson Araujo Andrew Rouditchenko Yuan Gong Saurabhchand Bhati Samuel Thomas Brian Kingsbury Leonid Karlinsky Rogerio Feris James Glass Hilde Kuehne 58 0 0 02 May 2025
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Soumya Jahagirdar Jayasree Saha C. V. Jawahar 70 0 0 11 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention Joe Dhanith Shravan Venkatraman Modigari Narendra Vigya Sharma Santhosh Malarvannan 111 0 0 20 Feb 2025
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling Zeyue Tian Zhaoyang Liu Ruibin Yuan Jiahao Pan Xiaoqiang Huang Xu Tan Xu Tan Qifeng Chen Yu Guo VGen 158 16 0 06 Jun 2024
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions Mathew Monfort SouYoung Jin Alexander H. Liu David Harwath Rogerio Feris James Glass Aude Oliva 27 59 0 10 May 2021
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval Ramon Sanabria Austin Waters Jason Baldridge 3DV 30 25 0 05 Apr 2021
QuerYD: A video dataset with high-quality text and audio narrations Andreea-Maria Oncescu João F. Henriques Yang Liu Andrew Zisserman Samuel Albanie VGen 29 11 0 22 Nov 2020
Self-Supervised MultiModal Versatile Networks Jean-Baptiste Alayrac Adrià Recasens R. Schneider Relja Arandjelović Jason Ramapuram J. Fauw Lucas Smaira Sander Dieleman Andrew Zisserman SSL 96 373 0 29 Jun 2020
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound Karren D. Yang Bryan C. Russell Justin Salamon SSL 49 75 0 11 Jun 2020
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning Elad Amrani Rami Ben-Ari Daniel Rotman A. Bronstein 43 122 0 06 Mar 2020
A Simple Framework for Contrastive Learning of Visual Representations Ting-Li Chen Simon Kornblith Mohammad Norouzi Geoffrey E. Hinton SSL 186 18,523 0 13 Feb 2020
End-to-End Learning of Visual Representations from Uncurated Instructional Videos Antoine Miech Jean-Baptiste Alayrac Lucas Smaira Ivan Laptev Josef Sivic Andrew Zisserman VGen SSL 93 705 0 13 Dec 2019
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury ... Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai Soumith Chintala ODL 211 42,038 0 03 Dec 2019
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Guohao Li Du Tran SSL 63 429 0 28 Nov 2019
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech David Harwath Wei-Ning Hsu James R. Glass 39 84 0 21 Nov 2019
Self-supervised Moving Vehicle Tracking with Stereo Sound Chuang Gan Hang Zhao Peihao Chen David D. Cox Antonio Torralba 27 147 0 25 Oct 2019
Large-scale representation learning from visually grounded untranscribed speech Gabriel Ilharco Yuan Zhang Jason Baldridge SSL 46 60 0 19 Sep 2019
Language learning using Speech to Image retrieval Danny Merkx S. Frank M. Ernestus 23 43 0 09 Sep 2019
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings Michael Wray Diane Larlus G. Csurka Dima Damen 71 152 0 09 Aug 2019
Use What You Have: Video Retrieval Using Representations From Collaborative Experts Yang Liu Samuel Albanie Arsha Nagrani Andrew Zisserman 55 387 0 31 Jul 2019
Learning Video Representations using Contrastive Bidirectional Transformer Chen Sun Fabien Baradel Kevin Patrick Murphy Cordelia Schmid SSL ViT 94 133 0 13 Jun 2019
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech Dimitri Zhukov Jean-Baptiste Alayrac Makarand Tapaswi Ivan Laptev Josef Sivic VGen 87 1,186 0 07 Jun 2019
Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data Hilde Kuehne Ahsan Iqbal Alexander Richard Juergen Gall 25 17 0 03 Jun 2019
Large-scale weakly-supervised pre-training for video action recognition Deepti Ghadiyaram Matt Feiszli Du Tran Xueting Yan Heng Wang D. Mahajan 40 299 0 02 May 2019
Self-Supervised Audio-Visual Co-Segmentation Andrew Rouditchenko Hang Zhao Chuang Gan Josh H. McDermott Antonio Torralba VLM SSL 30 103 0 18 Apr 2019
Co-Separating Sounds of Visual Objects Ruohan Gao Kristen Grauman 97 208 0 16 Apr 2019
Semantic query-by-example speech search using visual grounding Herman Kamper Aristotelis Anastassiou Karen Livescu 41 29 0 15 Apr 2019
The Sound of Motions Hang Zhao Chuang Gan Wei-Chiu Ma Antonio Torralba 51 252 0 11 Apr 2019
VideoBERT: A Joint Model for Video and Language Representation Learning Chen Sun Austin Myers Carl Vondrick Kevin Patrick Murphy Cordelia Schmid VLM SSL 37 1,238 0 03 Apr 2019
Cross-task weakly supervised learning from instructional videos Dimitri Zhukov Jean-Baptiste Alayrac R. G. Cinbis David Fouhey Ivan Laptev Josef Sivic SSL 103 245 0 19 Mar 2019
COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis Yansong Tang Dajun Ding Yongming Rao Yu Zheng Danyang Zhang Lili Zhao Jiwen Lu Jie Zhou 96 308 0 07 Mar 2019
2.5D Visual Sound Ruohan Gao Kristen Grauman VGen 97 130 0 11 Dec 2018
Learning from Multiview Correlations in Open-Domain Videos Nils Holzenberger Shruti Palaskar Pranava Madhyastha Florian Metze R. Arora SSL 38 11 0 21 Nov 2018
Multimodal One-Shot Learning of Speech and Images Ryan Eloff H. Engelbrecht Herman Kamper SSL VLM 27 35 0 09 Nov 2018
How2: A Large-scale Dataset for Multimodal Language Understanding Ramon Sanabria Ozan Caglayan Shruti Palaskar Desmond Elliott Loïc Barrault Lucia Specia Florian Metze VGen MLLM 57 287 0 01 Nov 2018
Self-Supervised Generation of Spatial Audio for 360 Video Pedro Morgado Nuno Vasconcelos Timothy R. Langlois Oliver Wang MDE 42 171 0 07 Sep 2018
A Joint Sequence Fusion Model for Video Question Answering and Retrieval Youngjae Yu Jongseok Kim Gunhee Kim 59 343 0 07 Aug 2018
Representation Learning with Contrastive Predictive Coding Aaron van den Oord Yazhe Li Oriol Vinyals DRL SSL 215 10,152 0 10 Jul 2018
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization Bruno Korbar Du Tran Lorenzo Torresani 67 473 0 30 Jun 2018
Visually grounded cross-lingual keyword spotting in speech Herman Kamper Michael Roth 34 34 0 13 Jun 2018
Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data Wei-Ning Hsu James R. Glass DRL 56 43 0 29 May 2018
On Learning Associations of Faces and Voices Changil Kim Hijung Valentina Shin Tae-Hyun Oh Alexandre Kaspar Mohamed A. Elgharib Wojciech Matusik CVBM 29 83 0 15 May 2018
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction Luowei Zhou Nathan Louis Jason J. Corso 57 94 0 08 May 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Andrew Owens Alexei A. Efros SSL 69 747 0 10 Apr 2018
The Sound of Pixels Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh H. McDermott Antonio Torralba VLM 58 532 0 09 Apr 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data Antoine Miech Ivan Laptev Josef Sivic 41 233 0 07 Apr 2018
Learning to Separate Object Sounds by Watching Unlabeled Video Ruohan Gao Rogerio Feris Kristen Grauman SSL 42 284 0 05 Apr 2018
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input David Harwath Adrià Recasens Dídac Surís Galen Chuang Antonio Torralba James R. Glass 59 200 0 04 Apr 2018
Seeing Voices and Hearing Faces: Cross-modal biometric matching Arsha Nagrani Samuel Albanie Andrew Zisserman CVBM 46 220 0 01 Apr 2018
Unsupervised Learning and Segmentation of Complex Activities from Video Fadime Sener Angela Yao 39 112 0 26 Mar 2018