SoundNet: Learning Sound Representations from Unlabeled Video

27 October 2016

Y. Aytar

Carl Vondrick

Antonio Torralba

SSL

ArXiv PDF HTML

Papers citing "SoundNet: Learning Sound Representations from Unlabeled Video"

50 / 180 papers shown

Title
Urban Rhapsody: Large-scale exploration of urban soundscapes Joao Rulff Fabio Miranda Maryam Hosseini Marcos Lage M. Cartwright Graham Dove J. P. Bello Claudio T. Silva 19 7 0 25 May 2022
Weakly-Supervised Action Detection Guided by Audio Narration Keren Ye Adriana Kovashka 38 0 0 12 May 2022
Probabilistic Representations for Video Contrastive Learning Jungin Park Jiyoung Lee Ig-Jae Kim Kwanghoon Sohn SSL 29 43 0 08 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Yan-Bo Lin Jie Lei Joey Tianyi Zhou Gedas Bertasius 49 39 0 06 Apr 2022
Learning Neural Acoustic Fields Andrew F. Luo Yilun Du Michael J. Tarr J. Tenenbaum Antonio Torralba Chuang Gan AI4CE 20 79 0 04 Apr 2022
1-D CNN based Acoustic Scene Classification via Reducing Layer-wise Dimensionality Arshdeep Singh 25 1 0 31 Mar 2022
Multitask Emotion Recognition Model with Knowledge Distillation and Task Discriminator Euiseok Jeong Geesung Oh Sejoon Lim CVBM 22 7 0 24 Mar 2022
Automated detection of foreground speech with wearable sensing in everyday home environments: A transfer learning approach Dawei Liang Zifan Xu Yinuo Chen Rebecca Adaimi David Harwath Edison Thomaz 48 1 0 21 Mar 2022
Learning Audio Representations with MLPs Mashrur M. Morshed Ahmad Omar Ahsan H. Mahmud Md. Kamrul Hasan 27 4 0 16 Mar 2022
Visually Supervised Speaker Detection and Localization via Microphone Array Davide Berghi A. Hilton Philip J. B. Jackson 24 11 0 07 Mar 2022
Audio Self-supervised Learning: A Survey Shuo Liu Adria Mallol-Ragolta Emilia Parada-Cabeleiro Kun Qian Xingshuo Jing Alexander Kathan Bin Hu Bjoern W. Schuller SSL 40 106 0 02 Mar 2022
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing Xian Liu Rui Qian Hang Zhou Di Hu Weiyao Lin Ziwei Liu Bolei Zhou Xiaowei Zhou 18 25 0 13 Feb 2022
Real-time Emergency Vehicle Event Detection Using Audio Data Zubayer Islam Mohamed Abdel-Aty 14 5 0 03 Feb 2022
Keyword localisation in untranscribed speech using visually grounded speech models Kayode Olaleye Dan Oneaţă Herman Kamper 32 7 0 02 Feb 2022
Sound and Visual Representation Learning with Multiple Pretraining Tasks A. Vasudevan Dengxin Dai Luc Van Gool SSL 33 6 0 04 Jan 2022
Multimodal Image Synthesis and Editing: The Generative AI Era Fangneng Zhan Yingchen Yu Rongliang Wu Jiahui Zhang Shijian Lu Lingjie Liu Adam Kortylewski Christian Theobalt Eric Xing EGVM 31 48 0 27 Dec 2021
Class-aware Sounding Objects Localization via Audiovisual Correspondence Di Hu Yake Wei Rui Qian Weiyao Lin Ruihua Song Ji-Rong Wen 24 41 0 22 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval Nina Shvetsova Brian Chen Andrew Rouditchenko Samuel Thomas Brian Kingsbury Rogerio Feris David Harwath James R. Glass Hilde Kuehne ViT 34 128 0 08 Dec 2021
Health Monitoring of Industrial machines using Scene-Aware Threshold Selection Arshdeep Singh R. Arvind Padmanabhan Rajan 19 1 0 21 Nov 2021
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video Rishabh Garg Ruohan Gao Kristen Grauman 15 28 0 21 Nov 2021
Wav2CLIP: Learning Robust Audio Representations From CLIP Ho-Hsiang Wu Prem Seetharaman Kundan Kumar J. P. Bello CLIP VLM 42 268 0 21 Oct 2021
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning Haider Al-Tahan Y. Mohsenzadeh SSL AI4TS 34 0 0 13 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... Prateek Verma AI4TS 32 2 0 07 Oct 2021
Understanding and Improving Usability of Data Dashboards for Simplified Privacy Control of Voice Assistant Data (Extended Version) Vandit Sharma Mainack Mondal 17 3 0 06 Oct 2021
Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction Hailong Ning Bin Zhao Zhanxuan Hu Lang He Ercheng Pei 32 10 0 17 Sep 2021
Parsing Birdsong with Deep Audio Embeddings Irina Tolkova Brian Chu Marcel Hedman Stefan Kahl Holger Klinck 36 10 0 20 Aug 2021
LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector Xiaoyang Guo Shaoshuai Shi Xiaogang Wang Hongsheng Li 3DPC 34 106 0 18 Aug 2021
Cross-modal Spectrum Transformation Network For Acoustic Scene classification Yang Liu A. Neophytou Sunando Sengupta Eric Sommerlade 21 9 0 13 Aug 2021
Hybrid Reasoning Network for Video-based Commonsense Captioning Weijiang Yu Jian Liang Lei Ji Lu Li Yuejian Fang Nong Xiao Nan Duan 19 10 0 05 Aug 2021
DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs J. Nistal Stefan Lattner G. Richard 21 8 0 03 Aug 2021
PERSA+: A Deep Learning Front-End for Context-Agnostic Audio Classification Lazaros Vrysis Iordanis Thoidis Charalampos A. Dimoulas G. Papanikolaou VLM 33 0 0 20 Jul 2021
FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos Sanchita Ghose John J. Prevost GAN 27 26 0 20 Jul 2021
Attention Bottlenecks for Multimodal Fusion Arsha Nagrani Shan Yang Anurag Arnab A. Jansen Cordelia Schmid Chen Sun 25 543 0 30 Jun 2021
Multi-level Attention Fusion Network for Audio-visual Event Recognition Mathilde Brousmiche Jean Rouat Stéphane Dupont 27 11 0 12 Jun 2021
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living Srijan Das Rui Dai Di Yang F. Brémond ViT 43 66 0 17 May 2021
Temporal-Spatial Feature Pyramid for Video Saliency Detection Qinyao Chang Shiping Zhu 41 27 0 10 May 2021
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos Yanghao Li Tushar Nagarajan Bo Xiong Kristen Grauman EgoV 51 84 0 16 Apr 2021
Can audio-visual integration strengthen robustness under multimodal attacks? Yapeng Tian Chenliang Xu AAML 36 37 0 05 Apr 2021
Unsupervised Sound Localization via Iterative Contrastive Learning Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin Ming-Hsuan Yang SSL 27 34 0 01 Apr 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning Mandela Patrick Yuki M. Asano Bernie Huang Ishan Misra Florian Metze Joao Henriques Andrea Vedaldi AI4TS 29 33 0 18 Mar 2021
Slow-Fast Auditory Streams For Audio Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen 24 66 0 05 Mar 2021
Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss Naoki Makishima Mana Ihori Akihiko Takashima Tomohiro Tanaka Shota Orihashi Ryo Masumura 30 8 0 02 Mar 2021
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge Francisco Rivera Valverde Juana Valeria Hurtado Abhinav Valada 26 72 0 01 Mar 2021
Learning Audio-Visual Correlations from Variational Cross-Modal Generation Ye Zhu Yu Wu Hugo Latapie Yi Yang Yan Yan SSL 44 20 0 05 Feb 2021
Environment Transfer for Distributed Systems Chunheng Jiang Jae-wook Ahn N. Desai 28 1 0 06 Jan 2021
ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction Samyak Jain P. Yarlagadda Shreyank Jyoti Shyamgopal Karthik Subramanian Ramanathan Vineet Gandhi ViT 29 66 0 11 Dec 2020
Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio João P. Ferreira Thiago M. Coutinho Thiago L. Gomes J. F. Neto Rafael Azevedo Renato Martins Erickson R. Nascimento GAN 36 68 0 25 Nov 2020
Learning Representations from Audio-Visual Spatial Alignment Pedro Morgado Yi Li Nuno Vasconcelos SSL 27 121 0 03 Nov 2020
Listening to Sounds of Silence for Speech Denoising Ruilin Xu Rundi Wu Y. Ishiwaka Carl Vondrick Changxi Zheng 28 32 0 22 Oct 2020
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning Ying Cheng Ruize Wang Zhihao Pan Rui Feng Yuejie Zhang SSL 36 106 0 13 Aug 2020