Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

10 April 2018

Papers citing "Audio-Visual Scene Analysis with Self-Supervised Multisensory Features"

50 / 181 papers shown

Title
$Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos$ Pano-AVQA: Grounded Audio-Visual Question Answering on 360 $^\circ$ Videos Heeseung Yun Youngjae Yu Wonsuk Yang Kangil Lee Gunhee Kim 25 78 0 11 Oct 2021
Modelling Neighbor Relation in Joint Space-Time Graph for Video Correspondence Learning Zixu Zhao Yueming Jin Pheng-Ann Heng SSL 37 21 0 28 Sep 2021
Visual Scene Graphs for Audio Source Separation Moitreya Chatterjee Jonathan Le Roux Narendra Ahuja A. Cherian 18 36 0 24 Sep 2021
Temporal Knowledge Consistency for Unsupervised Visual Representation Learning Wei Feng Yuanjiang Wang Lihua Ma Ye Yuan Chi Zhang SSL 21 13 0 24 Aug 2021
Exploring Data Aggregation and Transformations to Generalize across Visual Domains Antono DÍnnocente OOD 33 0 0 20 Aug 2021
Attention Bottlenecks for Multimodal Fusion Arsha Nagrani Shan Yang Anurag Arnab A. Jansen Cordelia Schmid Chen Sun 25 543 0 30 Jun 2021
LiRA: Learning Visual Speech Representations from Audio through Self-supervision Pingchuan Ma Rodrigo Mira Stavros Petridis Björn W. Schuller M. Pantic SSL 21 53 0 16 Jun 2021
Multi-level Attention Fusion Network for Audio-visual Event Recognition Mathilde Brousmiche Jean Rouat Stéphane Dupont 21 11 0 12 Jun 2021
VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living Srijan Das Rui Dai Di Yang F. Brémond ViT 43 66 0 17 May 2021
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation Hang Zhou Yasheng Sun Wayne Wu Chen Change Loy Xiaogang Wang Ziwei Liu CVBM 28 360 0 22 Apr 2021
A cappella: Audio-visual Singing Voice Separation Juan F. Montesinos V. S. Kadandale G. Haro 38 16 0 20 Apr 2021
Visually Informed Binaural Audio Generation without Binaural Audios Xudong Xu Hang Zhou Ziwei Liu Bo Dai Xiaogang Wang Dahua Lin DiffM 13 53 0 13 Apr 2021
Object Priors for Classifying and Localizing Unseen Actions Pascal Mettes William Thong Cees G. M. Snoek 24 20 0 10 Apr 2021
Can audio-visual integration strengthen robustness under multimodal attacks? Yapeng Tian Chenliang Xu AAML 25 37 0 05 Apr 2021
Unsupervised Sound Localization via Iterative Contrastive Learning Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin Ming-Hsuan Yang SSL 21 34 0 01 Apr 2021
Broaden Your Views for Self-Supervised Video Learning Adrià Recasens Pauline Luc Jean-Baptiste Alayrac Luyu Wang Ross Hemsley ... Florent Altché M. Valko Jean-Bastien Grill Aaron van den Oord Andrew Zisserman SSL AI4TS 31 127 0 30 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning Mandela Patrick Yuki M. Asano Bernie Huang Ishan Misra Florian Metze Joao Henriques Andrea Vedaldi AI4TS 29 33 0 18 Mar 2021
Beyond Image to Depth: Improving Depth Prediction using Echoes Kranti K. Parida Siddharth Srivastava Gaurav Sharma MDE 45 37 0 15 Mar 2021
Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss Naoki Makishima Mana Ihori Akihiko Takashima Tomohiro Tanaka Shota Orihashi Ryo Masumura 22 8 0 02 Mar 2021
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge Francisco Rivera Valverde Juana Valeria Hurtado Abhinav Valada 26 72 0 01 Mar 2021
Audiovisual Highlight Detection in Videos Karel Mundnich Alexandra Fenster Aparna Khare Shiva Sundaram 22 6 0 11 Feb 2021
Learning Audio-Visual Correlations from Variational Cross-Modal Generation Ye Zhu Yu Wu Hugo Latapie Yi Yang Yan Yan SSL 20 20 0 05 Feb 2021
MAAS: Multi-modal Assignation for Active Speaker Detection Juan Carlos León Alcázar Fabian Caba Heilbron Ali K. Thabet Guohao Li 62 51 0 11 Jan 2021
ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction Samyak Jain P. Yarlagadda Shreyank Jyoti Shyamgopal Karthik Subramanian Ramanathan Vineet Gandhi ViT 29 66 0 11 Dec 2020
Multi-modal Fusion for Single-Stage Continuous Gesture Recognition Harshala Gammulle Simon Denman Sridha Sridharan Clinton Fookes SLR 22 29 0 10 Nov 2020
Learning Representations from Audio-Visual Spatial Alignment Pedro Morgado Yi Li Nuno Vasconcelos SSL 27 121 0 03 Nov 2020
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Efthymios Tzinis Scott Wisdom A. Jansen Shawn Hershey Tal Remez D. Ellis J. Hershey 28 69 0 02 Nov 2020
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning L. Tao Xueting Wang T. Yamasaki VLM SSL 23 14 0 29 Oct 2020
Listening to Sounds of Silence for Speech Denoising Ruilin Xu Rundi Wu Y. Ishiwaka Carl Vondrick Changxi Zheng 25 32 0 22 Oct 2020
Contrastive Learning of General-Purpose Audio Representations Aaqib Saeed David Grangier Neil Zeghidour VLM SSL 15 262 0 21 Oct 2020
Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention Bin Duan Hao Tang Wei Wang Ziliang Zong Guowei Yang Yan Yan 30 59 0 14 Aug 2020
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning Ying Cheng Ruize Wang Zhihao Pan Rui Feng Yuejie Zhang SSL 27 106 0 13 Aug 2020
Self-Supervised Learning of Audio-Visual Objects from Video Triantafyllos Afouras Andrew Owens Joon Son Chung Andrew Zisserman SSL 19 250 0 10 Aug 2020
Learning Video Representations from Textual Web Supervision Jonathan C. Stroud Zhichao Lu Chen Sun Jia Deng Rahul Sukthankar Cordelia Schmid David A. Ross SSL 40 48 0 29 Jul 2020
Federated Self-Supervised Learning of Multi-Sensor Representations for Embedded Intelligence Aaqib Saeed Flora D. Salim T. Ozcelebi J. Lukkien FedML SSL 101 98 0 25 Jul 2020
Multiple Sound Sources Localization from Coarse to Fine Rui Qian Di Hu Heinrich Dinkel Mengyue Wu N. Xu Weiyao Lin 28 153 0 13 Jul 2020
Self-Supervised MultiModal Versatile Networks Jean-Baptiste Alayrac Adrià Recasens R. Schneider Relja Arandjelović Jason Ramapuram J. Fauw Lucas Smaira Sander Dieleman Andrew Zisserman SSL 40 371 0 29 Jun 2020
Self-Supervised Graph Transformer on Large-Scale Molecular Data Yu Rong Yatao Bian Tingyang Xu Wei-yang Xie Ying Wei Wenbing Huang Junzhou Huang AI4CE 19 25 0 18 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos Andrew Rouditchenko Angie Boggust David Harwath Brian Chen D. Joshi ... Rogerio Feris Brian Kingsbury M. Picheny Antonio Torralba James R. Glass SSL 22 141 0 16 Jun 2020
Telling Left from Right: Learning Spatial Correspondence of Sight and Sound Karren D. Yang Bryan C. Russell Justin Salamon SSL 24 75 0 11 Jun 2020
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network Lingyu Zhu Esa Rahtu 22 23 0 04 Jun 2020
FaceFilter: Audio-visual speech separation using still images Soo-Whan Chung Soyeon Choe Joon Son Chung Hong-Goo Kang CVBM 21 66 0 14 May 2020
VisualEchoes: Spatial Image Representation Learning through Echolocation Ruohan Gao Changan Chen Ziad Al-Halah Carl Schissler Kristen Grauman MDE SSL 171 83 0 04 May 2020
On the Role of Visual Cues in Audiovisual Speech Enhancement Zakaria Aldeneh Anushree Prasanna Kumar B. Theobald Erik Marchi S. Kajarekar Devang Naik Ahmed Hussen Abdelaziz 20 6 0 25 Apr 2020
Conditioned Source Separation for Music Instrument Performances Olga Slizovskaia G. Haro E. Gómez 27 38 0 08 Apr 2020
Speech2Action: Cross-modal Supervision for Action Recognition Arsha Nagrani Chen Sun David A. Ross Rahul Sukthankar Cordelia Schmid Andrew Zisserman 25 54 0 30 Mar 2020
A Metric Learning Reality Check Kevin Musgrave Serge J. Belongie Ser-Nam Lim 36 475 0 18 Mar 2020
Watching the World Go By: Representation Learning from Unlabeled Videos Daniel Gordon Kiana Ehsani D. Fox Ali Farhadi SSL AI4TS 29 87 0 18 Mar 2020
Evolving Losses for Unsupervised Video Representation Learning A. Piergiovanni A. Angelova Michael S. Ryoo SSL 19 138 0 26 Feb 2020
Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition M. Planamente A. Bottino Barbara Caputo EgoV 15 3 0 10 Feb 2020