Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.06687
Cited By
v1
v2
v3
v4 (latest)
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
12 November 2022
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"
50 / 383 papers shown
Title
MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation
Zihao Wang
Haoxuan Liu
Jiaxing Yu
Tao Zhang
Yan Liu
Kai Zhang
137
1
0
03 Jul 2024
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
Zeyu Xie
Xuenan Xu
Zhizheng Wu
Mengyue Wu
75
8
0
03 Jul 2024
AudioTime: A Temporally-aligned Audio-text Benchmark Dataset
Zeyu Xie
Xuenan Xu
Zhizheng Wu
Mengyue Wu
AuLLM
104
6
0
03 Jul 2024
SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond
Marco Comunità
Zhi-Wei Zhong
Akira Takahashi
Shiqi Yang
Mengjie Zhao
Koichi Saito
Yukara Ikemiya
Takashi Shibuya
Shusuke Takahashi
Yuki Mitsufuji
110
6
0
25 Jun 2024
Exploring compressibility of transformer based text-to-music (TTM) models
Vasileios Moschopoulos
Thanasis Kotsiopoulos
Pablo Peso Parada
Konstantinos Nikiforidis
Alexandros Stergiadis
Gerasimos Papakostas
Md. Asif Jalal
Jisi Zhang
Anastasios Drosou
Karthikeyan P. Saravanan
47
0
0
24 Jun 2024
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
Tung-Yu Wu
Yu-Xiang Lin
Tsui-Wei Weng
108
2
0
24 Jun 2024
Improving Text-To-Audio Models with Synthetic Captions
Zhifeng Kong
Sang-gil Lee
Deepanway Ghosal
Navonil Majumder
Ambuj Mehrish
Rafael Valle
Soujanya Poria
Bryan Catanzaro
110
13
0
18 Jun 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh
Sonal Kumar
Ashish Seth
Chandra Kiran Reddy Evuru
Utkarsh Tyagi
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
LRM
105
61
0
17 Jun 2024
Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation
Or Tal
Alon Ziv
Itai Gat
Felix Kreuk
Yossi Adi
86
17
0
16 Jun 2024
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
Ruibo Fu
Shuchen Shi
Hongming Guo
Tao Wang
Chunyu Qiang
...
Zhiyong Wang
Yukun Liu
Xuefei Liu
Shuai Zhang
Guanjun Li
VGen
42
0
0
15 Jun 2024
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Changan Chen
Puyuan Peng
Ami Baid
Zihui Xue
Wei-Ning Hsu
David Harwath
Kristen Grauman
VGen
82
8
0
13 Jun 2024
Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan
Heinrich Dinkel
Yongqing Wang
Jizhong Liu
Junbo Zhang
Yujun Wang
Bin Wang
VLM
76
5
0
11 Jun 2024
MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation
Lu Li
Tianze Zhang
Zhiqi Bu
Suyuchen Wang
Huan He
Jie Fu
Yonghui Wu
Jiang Bian
Yong Chen
Yoshua Bengio
FedML
MoMe
119
6
0
11 Jun 2024
BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification
June-Woo Kim
Miika Toikkanen
Yera Choi
Seoung-Eun Moon
Ho-Young Jung
93
9
0
10 Jun 2024
Zero-Shot Audio Captioning Using Soft and Hard Prompts
Yiming Zhang
Xuenan Xu
Ruoyi Du
Haohe Liu
Yuan Dong
Zheng-Hua Tan
Wenwu Wang
Zhanyu Ma
VLM
72
4
0
10 Jun 2024
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
Tianyu Zhang
Suyuchen Wang
Lu Li
Ge Zhang
Perouz Taslakian
Sai Rajeswar
Jie Fu
Bang Liu
Yoshua Bengio
105
5
0
10 Jun 2024
Zero-Shot End-To-End Spoken Question Answering In Medical Domain
Yanis Labrak
Adel Moumen
Richard Dufour
Mickael Rouvier
ELM
LM&MA
MedIm
74
1
0
09 Jun 2024
Contrastive Learning from Synthetic Audio Doppelgängers
Manuel Cherep
Nikhil Singh
93
1
0
09 Jun 2024
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
Jinlong Xue
Yayue Deng
Yingming Gao
Ya Li
RALM
VLM
136
7
0
06 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
271
17
0
06 Jun 2024
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
136
25
0
05 Jun 2024
Operational Latent Spaces
Scott H. Hawley
Austin R. Tackett
51
0
0
04 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
83
7
0
04 Jun 2024
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
Jinxing Zhou
Dan Guo
Yiran Zhong
Meng Wang
VLM
83
19
0
03 Jun 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Yongqi Wang
Wenxiang Guo
Rongjie Huang
Jia-Bin Huang
Zehan Wang
Fuming You
Ruiqi Li
Zhou Zhao
VGen
DiffM
130
13
0
01 Jun 2024
Creative Text-to-Audio Generation via Synthesizer Programming
Manuel Cherep
Nikhil Singh
Jessica Shand
71
4
0
01 Jun 2024
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Cheng-i Wang
Julian McAuley
Taylor Berg-Kirkpatrick
Nicholas J. Bryan
115
12
0
30 May 2024
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI
Che Liu
Changde Du
Xiaoyu Chen
Huiguang He
72
2
0
29 May 2024
Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning
Yixiao Zhang
Yukara Ikemiya
Woosung Choi
Naoki Murata
Marco A. Martínez-Ramírez
Liwei Lin
Gus Xia
Wei-Hsiang Liao
Yuki Mitsufuji
Simon Dixon
103
12
0
28 May 2024
Listenable Maps for Zero-Shot Audio Classifiers
Francesco Paissan
Luca Della Libera
Mirco Ravanelli
Cem Subakan
105
4
0
27 May 2024
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
Zixuan Wang
Qinkai Duan
Yu-Wing Tai
Chi-Keung Tang
109
3
0
25 May 2024
SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation
Xinlei Niu
Jing Zhang
Christian J. Walder
Charles Patrick Martin
52
2
0
24 May 2024
Quality-aware Masked Diffusion Transformer for Enhanced Music Generation
Chang Li
Ruoyu Wang
Lijuan Liu
Jun Du
Yixuan Sun
Zilu Guo
Zhenrong Zhang
Yuan Jiang
J. Gao
Feng Ma
119
5
0
24 May 2024
Images that Sound: Composing Images and Sounds on a Single Canvas
Ziyang Chen
Daniel Geng
Andrew Owens
DiffM
161
9
0
20 May 2024
CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing
Faegheh Sardari
A. Mustafa
Philip J. B. Jackson
Adrian Hilton
99
4
0
17 May 2024
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Manh Luong
Khai Nguyen
Nhat Ho
Reza Haf
D.Q. Phung
Lizhen Qu
65
13
0
16 May 2024
Dance Any Beat: Blending Beats with Visuals in Dance Video Generation
Xuanchen Wang
Heng Wang
Dongnan Liu
Weidong Cai
84
5
0
15 May 2024
Naturalistic Music Decoding from EEG Data via Latent Diffusion Models
Emilian Postolache
Natalia Polouliakh
Hiroaki Kitano
Akima Connelly
Emanuele Rodolà
Luca Cosmo
Taketo Akama
MedIm
DiffM
100
4
0
15 May 2024
FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation
Jianyi Chen
Wei Xue
Xu Tan
Zhen Ye
Qi-fei Liu
Yi-Ting Guo
66
2
0
13 May 2024
SonifyAR: Context-Aware Sound Generation in Augmented Reality
Xia Su
Jon E. Froehlich
Eunyee Koh
Chang Xiao
51
3
0
11 May 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Zehan Wang
Ziang Zhang
Xize Cheng
Rongjie Huang
Luping Liu
...
Haifeng Huang
Yang Zhao
Tao Jin
Peng Gao
Zhou Zhao
76
10
0
08 May 2024
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models
Alessandro Pianese
D. Cozzolino
Giovanni Poggi
L. Verdoliva
89
6
0
03 May 2024
Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition
Dongyuan Li
Ying Zhang
Yusong Wang
Funakoshi Kataro
Manabu Okumura
66
1
0
01 May 2024
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Yiitan Yuan
Zhuo Chen
Xubo Liu
Haohe Liu
Xuenan Xu
Dongya Jia
Yuanzhe Chen
Mark D. Plumbley
Wenwu Wang
CLIP
VLM
69
12
0
27 Apr 2024
HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
Xinlei Niu
Jing Zhang
Charles Patrick Martin
56
3
0
24 Apr 2024
Music Consistency Models
Zhengcong Fei
Mingyuan Fan
Junshi Huang
DiffM
97
5
0
20 Apr 2024
Track Role Prediction of Single-Instrumental Sequences
Changheon Han
Suhyun Lee
Minsam Ko
49
0
0
20 Apr 2024
GEOBIND: Binding Text, Image, and Audio through Satellite Images
Aayush Dhakal
Subash Khanal
Srikumar Sastry
Adeel Ahmad
Nathan Jacobs
73
3
0
17 Apr 2024
Long-form music generation with latent diffusion
Zach Evans
Julian Parker
CJ Carr
Zack Zukowski
Josiah Taylor
Jordi Pons
MGen
DiffM
122
45
0
16 Apr 2024
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Navonil Majumder
Chia-Yu Hung
Deepanway Ghosal
Wei-Ning Hsu
Rada Mihalcea
Soujanya Poria
143
61
0
15 Apr 2024
Previous
1
2
3
4
5
6
7
8
Next