AudioCLIP: Extending CLIP to Image, Text and Audio

24 June 2021

Papers citing "AudioCLIP: Extending CLIP to Image, Text and Audio"

34 / 34 papers shown

Title
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning Sangyeon Cho Jangyeong Jeon Mingi Kim Junyeong Kim CLIP VLM 177 0 0 30 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives Shuyu Li Shulei Ji Zihao Wang Songruoyao Wu Jiaxing Yu Kai Zhang MGen VGen 187 1 0 01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs Sanjoy Chowdhury Hanan Gani Nishit Anand Sayan Nag Ruohan Gao Mohamed Elhoseiny Salman Khan Dinesh Manocha LRM 120 0 0 29 Mar 2025
Continual Multimodal Contrastive Learning Xiaohao Liu Xiaobo Xia See-Kiong Ng Tat-Seng Chua CLL 167 1 0 19 Mar 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors Ruoxuan Feng Jiangyu Hu Wenke Xia Tianci Gao Ao Shen Yuhao Sun Bin Fang Di Hu 74 6 0 15 Feb 2025
OneLLM: One Framework to Align All Modalities with Language Jiaming Han Kaixiong Gong Yiyuan Zhang Jiaqi Wang Kaipeng Zhang Dahua Lin Yu Qiao Peng Gao Xiangyu Yue MLLM 147 121 0 10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey Gijs Wijngaard Elia Formisano Michele Esposito M. Dumontier 111 2 0 10 Jan 2025
Adversarial Hubness in Multi-Modal Retrieval Tingwei Zhang Fnu Suya Rishi Jha Collin Zhang Vitaly Shmatikov AAML 123 1 0 18 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder SungHeon Jeong Hanning Chen Sanggeon Yun Suhyeon Cho Wenjun Huang Xiangjian Liu Mohsen Imani 133 2 0 04 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids Piyush Bagad Makarand Tapaswi Cees G. M. Snoek Andrew Zisserman 118 0 0 18 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task H. Haresamudram Chi Ian Tang Sungho Suh P. Lukowicz Thomas Ploetz 118 2 0 11 Nov 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling Jihai Zhang Xiaoye Qu Tong Zhu Yu Cheng 70 8 0 28 Sep 2024
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection Lam Pham Phat Lam Dat Tran Hieu Tang Tin Nguyen Alexander Schindler Canh Vu Alexander Polonsky Canh Vu 75 4 0 23 Sep 2024
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching Jingyu Liu Minquan Wang Ye Ma Bo Wang Aozhu Chen Quan Chen Peng Jiang Xirong Li 69 1 0 23 Aug 2024
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing Chunyu Qiang Wang Geng Yi Zhao Ruibo Fu Tao Wang ... Chen Zhang Hao Che L. Wang Jianwu Dang J. Tao AI4TS 62 0 0 11 Aug 2024
Sequential Contrastive Audio-Visual Learning Ioannis Tsiamas Santiago Pascual Chunghsin Yeh Joan Serrà 69 2 0 08 Jul 2024
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition S. Verbitskiy Vladimir Berikov Viacheslav Vyshegorodtsev 56 73 0 03 Jun 2021
Multimodal Self-Supervised Learning of General Audio Representations Luyu Wang Pauline Luc Adrià Recasens Jean-Baptiste Alayrac Aaron van den Oord SSL 91 41 0 26 Apr 2021
ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio A. Guzhov Federico Raue Jörn Hees Andreas Dengel 55 38 0 23 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Hassan Akbari Liangzhe Yuan Rui Qian Wei-Hong Chuang Shih-Fu Chang Huayu Chen Boqing Gong ViT 287 581 0 22 Apr 2021
AST: Audio Spectrogram Transformer Yuan Gong Yu-An Chung James R. Glass ViT 97 849 0 05 Apr 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval Maksim Dzabraev M. Kalashnikov Stepan Alekseevich Komkov Aleksandr Petiushko 50 128 0 19 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision Alec Radford Jong Wook Kim Chris Hallacy Aditya A. Ramesh Gabriel Goh ... Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever CLIP VLM 731 28,659 0 26 Feb 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai ... Matthias Minderer G. Heigold Sylvain Gelly Jakob Uszkoreit N. Houlsby ViT 458 40,217 0 22 Oct 2020
Rethinking CNN Models for Audio Classification Kamalesh Palanisamy Dipika Singhania Angela Yao SSL 55 144 0 22 Jul 2020
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition Anurag Kumar V. Ithapu 44 35 0 30 Jun 2020
Self-Supervised MultiModal Versatile Networks Jean-Baptiste Alayrac Adrià Recasens R. Schneider Relja Arandjelović Jason Ramapuram J. Fauw Lucas Smaira Sander Dieleman Andrew Zisserman SSL 113 373 0 29 Jun 2020
ESResNet: Environmental Sound Classification Based on Visual Domain Models A. Guzhov Federico Raue Jörn Hees Andreas Dengel VLM 102 91 0 15 Apr 2020
Zero-Shot Audio Classification Based on Class Label Embeddings Huang Xie Tuomas Virtanen VLM 31 28 0 06 May 2019
Learning from Between-class Examples for Deep Sound Recognition Yuji Tokozume Yoshitaka Ushiku Tatsuya Harada SSL 64 236 0 28 Nov 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 514 129,831 0 12 Jun 2017
Xception: Deep Learning with Depthwise Separable Convolutions François Chollet MDE BDL PINN 953 14,493 0 07 Oct 2016
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification Justin Salamon J. P. Bello 53 1,300 0 15 Aug 2016
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 1.6K 192,638 0 10 Dec 2015