Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.06687
Cited By
v1
v2
v3
v4 (latest)
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
12 November 2022
Yusong Wu
Kai Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"
50 / 383 papers shown
Title
Towards Diverse and Efficient Audio Captioning via Diffusion Models
Manjie Xu
Chenxing Li
Xinyi Tu
Yong Ren
Ruibo Fu
Wei Liang
Dong Yu
DiffM
91
2
0
14 Sep 2024
Language-Queried Target Sound Extraction Without Parallel Training Data
Hao Ma
Zhiyuan Peng
Xu Li
Yukai Li
Mingjie Shao
Qiuqiang Kong
Xuelong Li
VLM
183
2
0
14 Sep 2024
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Sreyan Ghosh
Sonal Kumar
Chandra Kiran Reddy Evuru
Oriol Nieto
R. Duraiswami
Dinesh Manocha
VLM
119
6
0
13 Sep 2024
Towards Leveraging Contrastively Pretrained Neural Audio Embeddings for Recommender Tasks
Florian Grötschla
Luca Strassle
Luca A. Lanzendörfer
Roger Wattenhofer
63
0
0
13 Sep 2024
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
Zhiqi Huang
Dan Luo
Jun Wang
Huan Liao
Zhiheng Li
Zhiyong Wu
VGen
86
4
0
13 Sep 2024
STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment
Yong Ren
Chenxing Li
Manjie Xu
Wei Liang
Yu Gu
Rilin Chen
Dong Yu
VGen
DiffM
99
9
0
13 Sep 2024
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
Yan-Bo Lin
Yu Tian
L. Yang
Gedas Bertasius
Heng Wang
VGen
75
8
0
11 Sep 2024
1M-Deepfakes Detection Challenge
Zhixi Cai
Abhinav Dhall
Shreya Ghosh
Munawar Hayat
D. Kollias
Kalin Stefanov
Usman Tariq
96
2
0
11 Sep 2024
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Qi Yang
Binjie Mao
Zili Wang
Xing Nie
Pengfei Gao
Ying Guo
Cheng Zhen
Pengfei Yan
Shiming Xiang
VGen
DiffM
86
5
0
10 Sep 2024
WaveTransfer: A Flexible End-to-end Multi-instrument Timbre Transfer with Diffusion
Teysir Baoueb
Xiaoyu Bie
Hicham Janati
Gaël Richard
DiffM
41
1
0
06 Sep 2024
Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
Sohan Anisetty
James Hays
75
0
0
03 Sep 2024
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim
Minjeon Jeon
Jaeyoon Jung
Sang Hoon Woo
Jinjoo Lee
75
3
0
02 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
88
1
0
02 Sep 2024
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Andreea-Maria Oncescu
João F. Henriques
A. Sophia Koepke
91
2
0
01 Sep 2024
Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion
Yan Rong
Li Liu
57
5
0
01 Sep 2024
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye
Peiwen Sun
Jiahe Lei
Hongzhan Lin
Xu Tan
...
Jianyi Chen
Jiahao Pan
Qifeng Liu
Yike Guo
Wei Xue
AuLLM
65
19
0
30 Aug 2024
Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation
Kai Chen
Jiaqi Su
Taylor Berg-Kirkpatrick
Shlomo Dubnov
Zeyu Jin
63
2
0
28 Aug 2024
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
Zeyu Jin
Jia Jia
Qixin Wang
Kehan Li
Shuoyi Zhou
Songtao Zhou
Xiaoyu Qin
Zhiyong Wu
100
14
0
24 Aug 2024
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
Tiago Tavares
Fabio Ayres
Zhepei Wang
Paris Smaragdis
VLM
50
2
0
23 Aug 2024
Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition -- And Ways to Overcome Them
H. Haresamudram
Apoorva Beedu
Mashfiqui Rabbi
Sankalita Saha
Irfan Essa
Thomas Ploetz
100
6
0
21 Aug 2024
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound
Junwon Lee
Jaekwon Im
Dabin Kim
Juhan Nam
VGen
129
10
0
21 Aug 2024
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos
Dennis Fedorishin
Lie Lu
S. Setlur
Venu Govindaraju
VGen
101
3
0
20 Aug 2024
Unsupervised Composable Representations for Audio
Giovanni Bindi
P. Esling
DiffM
OCL
CoGe
74
1
0
19 Aug 2024
SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition
Mohamed Osman
Daniel Z. Kaplan
Tamer Nadeem
65
3
0
14 Aug 2024
Music2Latent: Consistency Autoencoders for Latent Audio Compression
Marco Pasini
Stefan Lattner
George Fazekas
67
8
0
12 Aug 2024
TEAdapter: Supply abundant guidance for controllable text-to-music generation
Jialing Zou
Jiahao Mei
Xudong Nan
Jinghua Li
Daoguo Dong
Liang He
60
0
0
09 Aug 2024
Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation
Michael Kolle
Maximilian Zorn
Jongmin Jung
Dasaem Jeong
63
1
0
02 Aug 2024
Combining audio control and style transfer using latent diffusion
Andreas Maier
Yuliya Burankova
Anne Hartebrodt
David B. Blumenthal
DiffM
68
3
0
31 Jul 2024
Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
Junda Wu
Cheng-i Wang
Amit Namburi
Jiaheng Dai
Hao-Wen Dong
Zhouhang Xie
Carol Chen
Julian McAuley
75
2
0
29 Jul 2024
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh
Shuo Han
Hazim T. Bukhari
Benjamin Elizalde
Hannes Gamper
Rita Singh
Bhiksha Raj
ReLM
LRM
AuLLM
85
11
0
25 Jul 2024
I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
Yannis Vasilakis
Rachel M. Bittner
Johan Pauwels
80
1
0
25 Jul 2024
Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment
Seong-Gyun Leem
Daniel Fulford
J. Onnela
David Gard
Carlos Busso
71
1
0
25 Jul 2024
Distortion Recovery: A Two-Stage Method for Guitar Effect Removal
Ying-Shuo Lee
Yueh-Po Peng
Jui-Te Wu
Ming Cheng
Li Su
Yi-Hsuan Yang
67
1
0
23 Jul 2024
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Tuomas Virtanen
Björn Schuller
94
4
0
22 Jul 2024
Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models
S. Nercessian
Johannes Imort
Ninon Devis
Frederik Blang
73
1
0
22 Jul 2024
DSP-informed bandwidth extension using locally-conditioned excitation and linear time-varying filter subnetworks
S. Nercessian
Alexey Lukin
Johannes Imort
73
0
0
22 Jul 2024
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation
Yun-Han Lan
Wen-Yi Hsiao
Hao-Chung Cheng
Yi-Hsuan Yang
88
9
0
21 Jul 2024
Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio
Roser Batlle-Roca
Wei-Hsiang Liao
Xavier Serra
Yuki Mitsufuji
Emilia Gómez
84
2
0
19 Jul 2024
Stable Audio Open
Zach Evans
Julian Parker
CJ Carr
Zack Zukowski
Josiah Taylor
Jordi Pons
255
53
0
19 Jul 2024
Enhancing Zero-shot Audio Classification using Sound Attribute Knowledge from Large Language Models
Xuenan Xu
Pingyue Zhang
Ming Yan
Ji Zhang
Mengyue Wu
VLM
126
0
0
19 Jul 2024
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Xuenan Xu
Haohe Liu
Mengyue Wu
Wenwu Wang
Mark D. Plumbley
98
1
0
19 Jul 2024
Audio-visual Generalized Zero-shot Learning the Easy Way
Shentong Mo
Pedro Morgado
61
5
0
18 Jul 2024
Audio Conditioning for Music Generation via Discrete Bottleneck Features
Simon Rouard
Yossi Adi
Jade Copet
Axel Roebel
Alexandre Défossez
MGen
101
1
0
17 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
106
13
0
16 Jul 2024
LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis
Zhenxiong Tan
Xinyin Ma
Gongfan Fang
Xinchao Wang
78
3
0
15 Jul 2024
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin
Daoyuan Chen
Wenhao Zhang
Liuyi Yao
Yilun Huang
Bolin Ding
Yaliang Li
Shuiguang Deng
145
7
0
11 Jul 2024
Video-to-Audio Generation with Hidden Alignment
Manjie Xu
Chenxing Li
Yong Ren
Rilin Chen
Yu Gu
Yu Gu
Dong Yu
Dong Yu
DiffM
VGen
111
12
0
10 Jul 2024
Read, Watch and Scream! Sound Generation from Text and Video
Yujin Jeong
Yunji Kim
Sanghyuk Chun
Jiyoung Lee
VGen
DiffM
84
15
0
08 Jul 2024
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
100
2
0
08 Jul 2024
A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining
Feiyang Xiao
Jian Guan
Qiaoxi Zhu
Xubo Liu
Wenbo Wang
Shuhan Qi
Kejia Zhang
Jianyuan Sun
Wenwu Wang
77
7
0
06 Jul 2024
Previous
1
2
3
4
5
6
7
8
Next