Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

20 December 2017

Andrew Owens

Jiajun Wu

Josh H. McDermott

William T. Freeman

Antonio Torralba

SSL

ArXiv PDF HTML

Papers citing "Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning"

50 / 52 papers shown

Title
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment Arda Senocak H. Ryu Junsik Kim Tae-Hyun Oh Hanspeter Pfister Joon Son Chung 38 3 0 18 Jul 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models David Kurzendörfer Otniel-Bogdan Mercea A. Sophia Koepke Zeynep Akata VLM CLIP 33 2 0 09 Apr 2024
Mix and Localize: Localizing Sound Sources in Mixtures Xixi Hu Ziyang Chen Andrew Owens 23 51 0 28 Nov 2022
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement Chenye Cui Yi Ren Jinglin Liu Rongjie Huang Zhou Zhao VGen 30 14 0 19 Nov 2022
Contrastive Audio-Visual Masked Autoencoder Yuan Gong Andrew Rouditchenko Alexander H. Liu David Harwath Leonid Karlinsky Hilde Kuehne James R. Glass 35 120 0 02 Oct 2022
TVLT: Textless Vision-Language Transformer Zineng Tang Jaemin Cho Yixin Nie Joey Tianyi Zhou VLM 51 28 0 28 Sep 2022
A Closer Look at Weakly-Supervised Audio-Visual Source Localization Shentong Mo Pedro Morgado 83 64 0 30 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey Yanbei Chen Massimiliano Mancini Xiatian Zhu Zeynep Akata 45 113 0 24 Aug 2022
Learning in Audio-visual Context: A Review, Analysis, and New Perspective Yake Wei Di Hu Yapeng Tian Xuelong Li 46 55 0 20 Aug 2022
Temporal and cross-modal attention for audio-visual zero-shot learning Otniel-Bogdan Mercea Thomas Hummel A. Sophia Koepke Zeynep Akata 38 25 0 20 Jul 2022
Finding Fallen Objects Via Asynchronous Audio-Visual Integration Chuang Gan Yi Gu Siyuan Zhou Jeremy Schwartz S. Alter James Traer Dan Gutfreund J. Tenenbaum Josh H. McDermott Antonio Torralba 47 19 0 07 Jul 2022
Sound Localization by Self-Supervised Time Delay Estimation Ziyang Chen David Fouhey Andrew Owens SSL 24 19 0 26 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Yan-Bo Lin Jie Lei Joey Tianyi Zhou Gedas Bertasius 41 39 0 06 Apr 2022
Audio Self-supervised Learning: A Survey Shuo Liu Adria Mallol-Ragolta Emilia Parada-Cabeleiro Kun Qian Xingshuo Jing Alexander Kathan Bin Hu Bjoern W. Schuller SSL 35 106 0 02 Mar 2022
Keyword localisation in untranscribed speech using visually grounded speech models Kayode Olaleye Dan Oneaţă Herman Kamper 26 7 0 02 Feb 2022
Video Transformers: A Survey Javier Selva A. S. Johansen Sergio Escalera Kamal Nasrollahi T. Moeslund Albert Clapés ViT 22 103 0 16 Jan 2022
Class-aware Sounding Objects Localization via Audiovisual Correspondence Di Hu Yake Wei Rui Qian Weiyao Lin Ruihua Song Ji-Rong Wen 24 41 0 22 Dec 2021
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning Haider Al-Tahan Y. Mohsenzadeh SSL AI4TS 34 0 0 13 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... Prateek Verma AI4TS 32 2 0 07 Oct 2021
FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos Sanchita Ghose John J. Prevost GAN 24 26 0 20 Jul 2021
LiRA: Learning Visual Speech Representations from Audio through Self-supervision Pingchuan Ma Rodrigo Mira Stavros Petridis Björn W. Schuller M. Pantic SSL 21 53 0 16 Jun 2021
Multi-level Attention Fusion Network for Audio-visual Event Recognition Mathilde Brousmiche Jean Rouat Stéphane Dupont 24 11 0 12 Jun 2021
Unsupervised Sound Localization via Iterative Contrastive Learning Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin Ming-Hsuan Yang SSL 24 34 0 01 Apr 2021
Listening to Sounds of Silence for Speech Denoising Ruilin Xu Rundi Wu Y. Ishiwaka Carl Vondrick Changxi Zheng 25 32 0 22 Oct 2020
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning Ying Cheng Ruize Wang Zhihao Pan Rui Feng Yuejie Zhang SSL 27 106 0 13 Aug 2020
Self-Supervised Learning of Audio-Visual Objects from Video Triantafyllos Afouras Andrew Owens Joon Son Chung Andrew Zisserman SSL 19 252 0 10 Aug 2020
Learning Video Representations from Textual Web Supervision Jonathan C. Stroud Zhichao Lu Chen Sun Jia Deng Rahul Sukthankar Cordelia Schmid David A. Ross SSL 40 48 0 29 Jul 2020
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing Yapeng Tian Dingzeyu Li Chenliang Xu 6 180 0 21 Jul 2020
Multiple Sound Sources Localization from Coarse to Fine Rui Qian Di Hu Heinrich Dinkel Mengyue Wu N. Xu Weiyao Lin 28 155 0 13 Jul 2020
Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision Abhinav Shukla Stavros Petridis M. Pantic SSL 32 16 0 08 Jul 2020
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network Lingyu Zhu Esa Rahtu 22 23 0 04 Jun 2020
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation Yizhe Zhu Martin Renqiang Min Asim Kadav H. Graf CoGe DRL 21 95 0 23 May 2020
VisualEchoes: Spatial Image Representation Learning through Echolocation Ruohan Gao Changan Chen Ziad Al-Halah Carl Schissler Kristen Grauman MDE SSL 171 83 0 04 May 2020
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Guohao Li Du Tran SSL 30 428 0 28 Nov 2019
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Arda Senocak Tae-Hyun Oh Junsik Kim Ming-Hsuan Yang In So Kweon SSL 30 52 0 20 Nov 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen EgoV 16 332 0 22 Aug 2019
Learning Video Representations using Contrastive Bidirectional Transformer Chen Sun Fabien Baradel Kevin Patrick Murphy Cordelia Schmid SSL ViT 21 133 0 13 Jun 2019
Audio-Visual Model Distillation Using Acoustic Images Andrés F. Pérez Valentina Sanguineti Pietro Morerio Vittorio Murino VLM 15 27 0 16 Apr 2019
The Sound of Motions Hang Zhao Chuang Gan Wei-Chiu Ma Antonio Torralba 17 250 0 11 Apr 2019
A Simple Baseline for Audio-Visual Scene-Aware Dialog Idan Schwartz A. Schwing Tamir Hazan 21 69 0 11 Apr 2019
An Attempt towards Interpretable Audio-Visual Video Captioning Yapeng Tian Chenxiao Guan Justin Goodman Marc Moore Chenliang Xu 33 20 0 07 Dec 2018
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild Samuel Albanie Arsha Nagrani Andrea Vedaldi Andrew Zisserman CVBM 27 270 0 16 Aug 2018
Unsupervised learning of foreground object detection Ioana Croitoru Simion-Vlad Bogolin Marius Leordeanu OCL 26 48 0 14 Aug 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Andrew Owens Alexei A. Efros SSL 48 743 0 10 Apr 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 38 528 0 18 Dec 2017
Interpreting Deep Visual Representations via Network Dissection Bolei Zhou David Bau A. Oliva Antonio Torralba FAtt MILM 29 323 0 15 Nov 2017
Unsupervised Representation Learning by Sorting Sequences Hsin-Ying Lee Jia-Bin Huang Maneesh Kumar Singh Ming-Hsuan Yang SSL DRL 17 533 0 03 Aug 2017
Active Decision Boundary Annotation with Deep Generative Models Miriam W. Huijser J. C. V. Gemert 16 46 0 20 Mar 2017
Colorization as a Proxy Task for Visual Understanding Gustav Larsson Michael Maire Gregory Shakhnarovich SSL 47 493 0 11 Mar 2017
Learning Features by Watching Objects Move Deepak Pathak Ross B. Girshick Piotr Dollár Trevor Darrell Bharath Hariharan SSL VOS OCL 24 522 0 19 Dec 2016