Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.13043
Cited By
AudioCLIP: Extending CLIP to Image, Text and Audio
24 June 2021
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"AudioCLIP: Extending CLIP to Image, Text and Audio"
34 / 34 papers shown
Title
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
177
0
0
30 Apr 2025
A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives
Shuyu Li
Shulei Ji
Zihao Wang
Songruoyao Wu
Jiaxing Yu
Kai Zhang
MGen
VGen
187
1
0
01 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
120
0
0
29 Mar 2025
Continual Multimodal Contrastive Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
CLL
167
1
0
19 Mar 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Ruoxuan Feng
Jiangyu Hu
Wenke Xia
Tianci Gao
Ao Shen
Yuhao Sun
Bin Fang
Di Hu
74
6
0
15 Feb 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
147
121
0
10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
111
2
0
10 Jan 2025
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
123
1
0
18 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
SungHeon Jeong
Hanning Chen
Sanggeon Yun
Suhyeon Cho
Wenjun Huang
Xiangjian Liu
Mohsen Imani
133
2
0
04 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
118
0
0
18 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
118
2
0
11 Nov 2024
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang
Xiaoye Qu
Tong Zhu
Yu Cheng
70
8
0
28 Sep 2024
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
Lam Pham
Phat Lam
Dat Tran
Hieu Tang
Tin Nguyen
Alexander Schindler
Canh Vu
Alexander Polonsky
Canh Vu
75
4
0
23 Sep 2024
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
Jingyu Liu
Minquan Wang
Ye Ma
Bo Wang
Aozhu Chen
Quan Chen
Peng Jiang
Xirong Li
69
1
0
23 Aug 2024
VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing
Chunyu Qiang
Wang Geng
Yi Zhao
Ruibo Fu
Tao Wang
...
Chen Zhang
Hao Che
L. Wang
Jianwu Dang
J. Tao
AI4TS
62
0
0
11 Aug 2024
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
69
2
0
08 Jul 2024
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition
S. Verbitskiy
Vladimir Berikov
Viacheslav Vyshegorodtsev
56
73
0
03 Jun 2021
Multimodal Self-Supervised Learning of General Audio Representations
Luyu Wang
Pauline Luc
Adrià Recasens
Jean-Baptiste Alayrac
Aaron van den Oord
SSL
91
41
0
26 Apr 2021
ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
55
38
0
23 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Huayu Chen
Boqing Gong
ViT
287
581
0
22 Apr 2021
AST: Audio Spectrogram Transformer
Yuan Gong
Yu-An Chung
James R. Glass
ViT
97
849
0
05 Apr 2021
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
Maksim Dzabraev
M. Kalashnikov
Stepan Alekseevich Komkov
Aleksandr Petiushko
50
128
0
19 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
731
28,659
0
26 Feb 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
458
40,217
0
22 Oct 2020
Rethinking CNN Models for Audio Classification
Kamalesh Palanisamy
Dipika Singhania
Angela Yao
SSL
55
144
0
22 Jul 2020
A Sequential Self Teaching Approach for Improving Generalization in Sound Event Recognition
Anurag Kumar
V. Ithapu
44
35
0
30 Jun 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
113
373
0
29 Jun 2020
ESResNet: Environmental Sound Classification Based on Visual Domain Models
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
VLM
102
91
0
15 Apr 2020
Zero-Shot Audio Classification Based on Class Label Embeddings
Huang Xie
Tuomas Virtanen
VLM
31
28
0
06 May 2019
Learning from Between-class Examples for Deep Sound Recognition
Yuji Tokozume
Yoshitaka Ushiku
Tatsuya Harada
SSL
64
236
0
28 Nov 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
514
129,831
0
12 Jun 2017
Xception: Deep Learning with Depthwise Separable Convolutions
François Chollet
MDE
BDL
PINN
953
14,493
0
07 Oct 2016
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
Justin Salamon
J. P. Bello
53
1,300
0
15 Aug 2016
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
1.6K
192,638
0
10 Dec 2015
1