Objects that Sound

18 December 2017

Papers citing "Objects that Sound"

50 / 134 papers shown

Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment Edson Araujo Andrew Rouditchenko Yuan Gong Saurabhchand Bhati Samuel Thomas Brian Kingsbury Leonid Karlinsky Rogerio Feris James Glass 41 0 0 02 May 2025
Improving Sound Source Localization with Joint Slot Attention on Image and Audio Inho Kim Youngkil Song Jicheol Park Won Hwa Kim Suha Kwak 22 0 0 21 Apr 2025
The Sound of Water: Inferring Physical Properties from Pouring Liquids Piyush Bagad Makarand Tapaswi Cees G. M. Snoek Andrew Zisserman 45 0 0 18 Nov 2024
Towards Open-Vocabulary Audio-Visual Event Localization Jinxing Zhou Dan Guo Ruohao Guo Yuxin Mao Jingjing Hu Yiran Zhong Xiaojun Chang Hao Wu VLM 58 4 0 18 Nov 2024
A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio Xavier Juanola Gloria Haro Magdalena Fuentes 31 2 0 01 Oct 2024
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment Arda Senocak H. Ryu Junsik Kim Tae-Hyun Oh Hanspeter Pfister Joon Son Chung 38 3 0 18 Jul 2024
Sequential Contrastive Audio-Visual Learning Ioannis Tsiamas Santiago Pascual Chunghsin Yeh Joan Serrà 41 2 0 08 Jul 2024
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation Se-eun Yoon Hyunsik Jeon Julian McAuley 40 0 0 23 May 2024
Images that Sound: Composing Images and Sounds on a Single Canvas Ziyang Chen Daniel Geng Andrew Owens DiffM 50 9 0 20 May 2024
Made to Order: Discovering monotonic temporal changes via self-supervised video ordering Charig Yang Weidi Xie Andrew Zisserman 34 1 0 25 Apr 2024
Understanding Hyperbolic Metric Learning through Hard Negative Sampling Yun Yue Fangzhou Lin Guanyi Mou Ziming Zhang SSL 30 1 0 23 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners Yan-Bo Lin Gedas Bertasius 37 5 0 28 Mar 2024
Synchformer: Efficient Synchronization from Sparse Cues Vladimir E. Iashin Weidi Xie Esa Rahtu Andrew Zisserman 24 11 0 29 Jan 2024
A-JEPA: Joint-Embedding Predictive Architecture Can Listen Zhengcong Fei Mingyuan Fan Junshi Huang 25 17 0 27 Nov 2023
OmniVec: Learning robust representations with cross modal sharing Siddharth Srivastava Gaurav Sharma SSL 27 64 0 07 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA Asmar Nadeem Adrian Hilton R. Dawes Graham A. Thomas A. Mustafa 23 9 0 25 Oct 2023
Multimodal Variational Auto-encoder based Audio-Visual Segmentation Yuxin Mao Jing Zhang Mochu Xiang Yiran Zhong Yuchao Dai 37 34 0 12 Oct 2023
Deep Video Inpainting Guided by Audio-Visual Self-Supervision Kyuyeon Kim Junsik Jung Woo Jae Kim Sung-eui Yoon SSL 28 1 0 11 Oct 2023
Cross-modal Cognitive Consensus guided Audio-Visual Segmentation Zhaofeng Shi Qingbo Wu Fanman Meng Linfeng Xu Hongliang Li VOS 30 3 0 10 Oct 2023
Sound Source Localization is All about Cross-Modal Alignment Arda Senocak H. Ryu Junsik Kim Tae-Hyun Oh Hanspeter Pfister Joon Son Chung 36 18 0 19 Sep 2023
A Multimodal Prototypical Approach for Unsupervised Sound Classification Saksham Singh Kushwaha Magdalena Fuentes 22 8 0 21 Jun 2023
Transavs: End-To-End Audio-Visual Segmentation With Transformer Yuhang Ling Yuxi Li Zhenye Gan Jiangning Zhang M. Chi Yabiao Wang VOS ViT 34 1 0 12 May 2023
Noisy Correspondence Learning with Meta Similarity Correction Haocheng Han Kaiyao Miao Qinghua Zheng Minnan Luo 32 28 0 13 Apr 2023
Egocentric Auditory Attention Localization in Conversations Fiona Ryan Hao Jiang Abhinav Shukla James M. Rehg V. Ithapu EgoV 29 16 0 28 Mar 2023
LipLearner: Customizable Silent Speech Interactions on Mobile Devices Zixiong Su Shitao Fang Jun Rekimoto 18 26 0 12 Feb 2023
LoCoNet: Long-Short Context Network for Active Speaker Detection Xizi Wang Feng Cheng Gedas Bertasius David J. Crandall 26 15 0 19 Jan 2023
Look, Listen, and Attack: Backdoor Attacks Against Video Action Recognition Hasan Hammoud Shuming Liu Mohammad Alkhrashi Fahad Albalawi Guohao Li AAML 32 8 0 03 Jan 2023
CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos Hao-Wen Dong Naoya Takahashi Yuki Mitsufuji Julian McAuley Taylor Berg-Kirkpatrick VLM CLIP 28 25 0 14 Dec 2022
Motion and Context-Aware Audio-Visual Conditioned Video Prediction Yating Xu Conghui Hu G. Lee VGen 40 0 0 09 Dec 2022
Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection Rahul Sharma Shrikanth Narayanan 37 8 0 01 Dec 2022
Mix and Localize: Localizing Sound Sources in Mixtures Xixi Hu Ziyang Chen Andrew Owens 23 51 0 28 Nov 2022
Unifying Tracking and Image-Video Object Detection Peirong Liu Rui Wang Pengchuan Zhang Omid Poursaeed Yipin Zhou Xuefei Cao Sreya . Dutta Roy Ashish Shah Ser-Nam Lim 18 0 0 20 Nov 2022
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization Yuanyuan Jiang Jianqin Yin Yonghao Dang 35 5 0 11 Oct 2022
Contrastive Audio-Visual Masked Autoencoder Yuan Gong Andrew Rouditchenko Alexander H. Liu David Harwath Leonid Karlinsky Hilde Kuehne James R. Glass 35 120 0 02 Oct 2022
Learning State-Aware Visual Representations from Audible Interactions Himangi Mittal Pedro Morgado Unnat Jain Abhinav Gupta 78 23 0 27 Sep 2022
Unsupervised active speaker detection in media content using cross-modal information Rahul Sharma Shrikanth Narayanan 14 3 0 24 Sep 2022
Learning in Audio-visual Context: A Review, Analysis, and New Perspective Yake Wei Di Hu Yapeng Tian Xuelong Li 46 55 0 20 Aug 2022
Impact Makes a Sound and Sound Makes an Impact: Sound Guides Representations and Explorations Xufeng Zhao C. Weber Muhammad Burhan Hafez S. Wermter 18 8 0 04 Aug 2022
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation Efthymios Tzinis Scott Wisdom Tal Remez J. Hershey 39 29 0 20 Jul 2022
Is an Object-Centric Video Representation Beneficial for Transfer? Chuhan Zhang Ankush Gupta Andrew Zisserman ViT 37 27 0 20 Jul 2022
Temporal and cross-modal attention for audio-visual zero-shot learning Otniel-Bogdan Mercea Thomas Hummel A. Sophia Koepke Zeynep Akata 38 25 0 20 Jul 2022
SVGraph: Learning Semantic Graphs from Instructional Videos Madeline Chantry Schiappa Y. S. Rawat 17 4 0 16 Jul 2022
Masked Autoencoders that Listen Po-Yao (Bernie) Huang Hu Xu Juncheng Billy Li Alexei Baevski Michael Auli Wojciech Galuba Florian Metze Christoph Feichtenhofer 18 268 0 13 Jul 2022
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection Jiashuo Yu Jin-Yuan Liu Ying Cheng Rui Feng Yuejie Zhang 19 34 0 12 Jul 2022
Audio-Visual Segmentation Jinxing Zhou Jianyuan Wang Jingyang Zhang Weixuan Sun Jing Zhang Stan Birchfield Dan Guo Lingpeng Kong Meng Wang Yiran Zhong VOS 33 110 0 11 Jul 2022
Finding Fallen Objects Via Asynchronous Audio-Visual Integration Chuang Gan Yi Gu Siyuan Zhou Jeremy Schwartz S. Alter James Traer Dan Gutfreund J. Tenenbaum Josh H. McDermott Antonio Torralba 47 19 0 07 Jul 2022
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization Jiashuo Yu Junfu Pu Ying Cheng Rui Feng Ying Shan 21 5 0 07 Jul 2022
Visual-Assisted Sound Source Depth Estimation in the Wild Wei Sun L. Qiu MDE 13 0 0 07 Jul 2022
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key! Chenglizhao Chen Mengke Song Wenfeng Song Li Guo Muwei Jian 35 25 0 20 Jun 2022
Self-Supervised Learning for Videos: A Survey Madeline Chantry Schiappa Y. S. Rawat M. Shah SSL 36 131 0 18 Jun 2022