Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1910.09387
Cited By
Clotho: An Audio Captioning Dataset
21 October 2019
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Clotho: An Audio Captioning Dataset"
50 / 269 papers shown
Title
Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training
Jianyuan Feng
Guangzheng Li
Yangfei Xu
VLM
12
0
0
20 Jun 2025
Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models
Ruiyang Zhang
Hu Zhang
Hao Fei
Zhedong Zheng
UQCV
39
0
0
09 Jun 2025
Learning Perceptually Relevant Temporal Envelope Morphing
Satvik Dixit
Sungjoon Park
Chris Donahue
Laurie M. Heller
37
0
0
02 Jun 2025
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Artemis Panagopoulou
Le Xue
Honglu Zhou
Silvio Savarese
Ran Xu
Caiming Xiong
Chris Callison-Burch
Mark Yatskar
Juan Carlos Niebles
52
0
0
02 Jun 2025
The iNaturalist Sounds Dataset
Mustafa Chasmai
Alexander Shepard
Subhransu Maji
Grant Van Horn
42
2
0
31 May 2025
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
Jihyoung Jang
Minwook Bae
Minji Kim
Dilek Z. Hakkani-Tür
Hyounghun Kim
25
0
0
31 May 2025
IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling
Kuan Po Huang
Shu-Wen Yang
Huy Phan
Bo-Ru Lu
Byeonggeun Kim
...
Qingming Tang
Shalini Ghosh
Hung-yi Lee
Chieh-Chi Kao
Chao Wang
30
0
0
31 May 2025
Text-Queried Audio Source Separation via Hierarchical Modeling
Xinlei Yin
Xiulian Peng
Xue Jiang
Zhiwei Xiong
Yan Lu
54
0
0
27 May 2025
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Junlin Li
Guodong DU
Jing Li
Sim Kuan Goh
Wenya Wang
...
Fangming Liu
Jing Li
Saleh Alharbi
Daojing He
Min Zhang
MoMe
CLL
141
1
0
21 May 2025
Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models
Riccardo Passoni
Francesca Ronchini
Luca Comanducci
Romain Serizel
Fabio Antonacci
DiffM
75
0
0
12 May 2025
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
Paul Primus
Florian Schmid
Gerhard Widmer
CLIP
AI4TS
VLM
58
0
0
12 May 2025
BLAB: Brutally Long Audio Bench
Orevaoghene Ahia
Martijn Bartelds
Kabir Ahuja
Hila Gonen
Valentin Hofmann
...
Noah Bennett
Shinji Watanabe
Noah A. Smith
Yulia Tsvetkov
Sachin Kumar
AuLLM
LM&MA
VLM
129
0
0
05 May 2025
Kimi-Audio Technical Report
KimiTeam
Ding Ding
Zeqian Ju
Yichong Leng
Shixuan Liu
...
Zhiyong Yang
Aoxiong Yin
Ruibin Yuan
Yanzhe Zhang
Zaida Zhou
AuLLM
VLM
185
13
0
25 Apr 2025
Transformation of audio embeddings into interpretable, concept-based representations
Alice Zhang
Edison Thomaz
Lie Lu
78
0
0
18 Apr 2025
Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection
Hyeonuk Nam
Yong-Hwa Park
56
0
0
17 Apr 2025
ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling
Dongchao Yang
Songxiang Liu
Haohan Guo
Jiankun Zhao
Yuanyuan Wang
...
Xubo Liu
Xueyuan Chen
Xu Tan
Xixin Wu
Helen Meng
233
2
0
14 Apr 2025
Policy Optimization Algorithms in a Unified Framework
Shuang Wu
72
0
0
04 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
120
2
0
02 Apr 2025
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
103
0
0
01 Apr 2025
Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context
Junyi Ao
Dekun Chen
Xiaohai Tian
Wenjie Feng
Jing Zhang
Lu Lu
Yansen Wang
Haizhou Li
Zhizheng Wu
AuLLM
116
0
0
19 Mar 2025
Continual Multimodal Contrastive Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
CLL
235
2
0
19 Mar 2025
Mellow: a small audio language model for reasoning
Soham Deshmukh
Satvik Dixit
Rita Singh
Bhiksha Raj
AuLLM
ReLM
LRM
113
4
0
11 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Ming-Yu Liu
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLM
AuLLM
LRM
126
21
0
06 Mar 2025
GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors
Yaopei Zeng
Yuanpu Cao
Lu Lin
DiffM
WIGM
108
0
0
05 Mar 2025
Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models
Zhifei Xie
Mingbao Lin
Ziqiang Liu
Pengcheng Wu
Shuicheng Yan
Chunyan Miao
AuLLM
OffRL
LRM
155
17
0
04 Mar 2025
JiTTER: Jigsaw Temporal Transformer for Event Reconstruction for Self-Supervised Sound Event Detection
Hyeonuk Nam
Yong-Hwa Park
80
1
0
28 Feb 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
216
4
0
26 Feb 2025
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation
Yoonjin Chung
Pilsun Eu
Junwon Lee
Keunwoo Choi
Juhan Nam
Ben Sangbae Chon
EGVM
107
4
0
21 Feb 2025
Keep what you need : extracting efficient subnetworks from large audio representation models
David Genova
P. Esling
Tom Hurlin
119
0
0
18 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
351
7
0
12 Feb 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
164
4
0
28 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
254
134
0
10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
187
3
0
10 Jan 2025
Classifier-Guided Captioning Across Modalities
Ariel Shaulov
Tal Shaharabany
E. Shaar
Gal Chechik
Lior Wolf
91
0
0
03 Jan 2025
Language-based Audio Retrieval with Co-Attention Networks
Haoran Sun
Zehua Wang
Qiuyi Chen
Jianjun Chen
Jia Wang
Haiyang Zhang
65
0
0
31 Dec 2024
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
122
3
0
26 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
Alex Schwing
Yuki Mitsufuji
VGen
294
18
0
19 Dec 2024
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
Satvik Dixit
Laurie M. Heller
Chris Donahue
VLM
100
7
0
18 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGen
VLM
321
2
0
11 Nov 2024
MACE: Leveraging Audio for Evaluating Audio Captioning Systems
Satvik Dixit
Soham Deshmukh
Bhiksha Raj
66
1
0
01 Nov 2024
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
Junwon Lee
Modan Tailleur
Laurie M. Heller
Keunwoo Choi
Mathieu Lagrange
Brian McFee
Keisuke Imoto
Yuki Okamoto
54
5
0
23 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
160
8
0
23 Oct 2024
Do Audio-Language Models Understand Linguistic Variations?
Ramaneswaran Selvakumar
Sonal Kumar
Hemant Kumar Giri
Nishit Anand
Ashish Seth
Sreyan Ghosh
Dinesh Manocha
AuLLM
VLM
147
1
0
21 Oct 2024
Construction and Analysis of Impression Caption Dataset for Environmental Sounds
Yuki Okamoto
Ryotaro Nagase
Minami Okamoto
Yuki Saito
Keisuke Imoto
Takahiro Fukumori
Y. Yamashita
42
0
0
20 Oct 2024
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLM
VLM
77
1
0
15 Oct 2024
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation
Mithun Manivannan
Vignesh Nethrapalli
Mark Cartwright
62
1
0
15 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
136
5
0
14 Oct 2024
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Wenxi Chen
Ziyang Ma
Xiquan Li
Xuenan Xu
Yuzhe Liang
Zhisheng Zheng
Kai Yu
Xie Chen
100
7
0
12 Oct 2024
DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning
Xiquan Li
Wenxi Chen
Ziyang Ma
Xuenan Xu
Yuzhe Liang
Zhisheng Zheng
Qiuqiang Kong
Xie Chen
VLM
121
6
0
12 Oct 2024
The language of sound search: Examining User Queries in Audio Search Engines
Benno Weck
Frederic Font
52
1
0
10 Oct 2024
1
2
3
4
5
6
Next