ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.09387
  4. Cited By
Clotho: An Audio Captioning Dataset

Clotho: An Audio Captioning Dataset

21 October 2019
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
ArXiv (abs)PDFHTML

Papers citing "Clotho: An Audio Captioning Dataset"

50 / 270 papers shown
Title
Speech Translation with Speech Foundation Models and Large Language
  Models: What is There and What is Missing?
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
135
18
0
19 Feb 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative
  Comprehension
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang
Jin Xu
Wenrui Liu
Yunfei Chu
Ziyue Jiang
...
Yichong Leng
Yuanjun Lv
Zhou Zhao
Chang Zhou
Jingren Zhou
LM&MAAuLLMALM
113
85
0
12 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
94
12
0
10 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLMLM&MAMLLM
172
94
0
02 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for
  Automated Audio Captioning
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIPVLM
72
25
0
31 Jan 2024
A Survey on Data Augmentation in Large Model Era
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MAVLM
137
27
0
27 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
86
0
0
22 Jan 2024
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
  Data
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Yuhui Zhang
Elaine Sui
Serena Yeung-Levy
85
10
0
16 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
150
44
0
11 Jan 2024
Learning Audio Concepts from Counterfactual Natural Language
Learning Audio Concepts from Counterfactual Natural Language
Ali Vosoughi
Luca Bondi
Ho-Hsiang Wu
Chenliang Xu
CML
93
5
0
10 Jan 2024
Towards Weakly Supervised Text-to-Audio Grounding
Towards Weakly Supervised Text-to-Audio Grounding
Xuenan Xu
Ziyang Ma
Mengyue Wu
Kai Yu
AI4TS
83
9
0
05 Jan 2024
Auffusion: Leveraging the Power of Diffusion and Large Language Models
  for Text-to-Audio Generation
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
Jinlong Xue
Yayue Deng
Yingming Gao
Ya Li
DiffM
105
36
0
02 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
222
100
0
29 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLMMLLM
102
175
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
92
23
0
27 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
Anthony L. Caterini
127
3
0
15 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLMMLLM
76
44
0
11 Dec 2023
Speaker-Text Retrieval via Contrastive Learning
Speaker-Text Retrieval via Contrastive Learning
Xuechen Liu
Xin Wang
Erica Cooper
Xiaoxiao Miao
Junichi Yamagishi
VLM
45
1
0
11 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLMMLLM
151
61
0
30 Nov 2023
ViT-Lens: Towards Omni-modal Representations
ViT-Lens: Towards Omni-modal Representations
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
99
20
0
27 Nov 2023
Zero-shot audio captioning with audio-language model guidance and audio
  context keywords
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Leonard Salewski
Stefan Fauth
A. Sophia Koepke
Zeynep Akata
54
11
0
14 Nov 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified
  Large-Scale Audio-Language Models
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu
Jin Xu
Xiaohuan Zhou
Qian Yang
Shiliang Zhang
Zhijie Yan
Chang Zhou
Jingren Zhou
AuLLM
150
351
0
14 Nov 2023
FLAP: Fast Language-Audio Pre-training
FLAP: Fast Language-Audio Pre-training
Ching-Feng Yeh
Po-Yao Huang
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
CLIPVLM
74
9
0
02 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
163
44
0
01 Nov 2023
SALMONN: Towards Generic Hearing Abilities for Large Language Models
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Changli Tang
Wenyi Yu
Guangzhi Sun
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
LM&MAAuLLM
117
264
0
20 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
63
0
0
20 Oct 2023
CLARA: Multilingual Contrastive Learning for Audio Representation
  Acquisition
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
K. A. Noriy
Xiaosong Yang
Marcin Budka
Jian Jun Zhang
VLM
81
3
0
18 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
  Gallery Banks
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery Banks
Yimu Wang
Xiangru Jian
Bo Xue
57
11
0
17 Oct 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language
  Models
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLMVLMCoGe
122
26
0
12 Oct 2023
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT
Zhihao Du
Jiaming Wang
Qian Chen
Yunfei Chu
Zhifu Gao
...
Wen Wang
Siqi Zheng
Chang Zhou
Zhijie Yan
Shiliang Zhang
LLMAGVLMAuLLMLM&MA
131
87
0
07 Oct 2023
Prompting Audios Using Acoustic Properties For Emotion Representation
Prompting Audios Using Acoustic Properties For Emotion Representation
Hira Dhamyal
Benjamin Elizalde
Soham Deshmukh
Huaming Wang
Bhiksha Raj
Rita Singh
58
4
0
03 Oct 2023
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Dongchao Yang
Jinchuan Tian
Xuejiao Tan
Rongjie Huang
Songxiang Liu
...
Jiang Bian
Xixin Wu
Zhou Zhao
Shinji Watanabe
Helen M. Meng
CVBMAuLLM
155
128
0
01 Oct 2023
Improving Audio Captioning Models with Fine-grained Audio Features, Text
  Embedding Supervision, and LLM Mix-up Augmentation
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
Shih-Lun Wu
Xuankai Chang
Gordon Wichern
Jee-weon Jung
Franccois G. Germain
Jonathan Le Roux
Shinji Watanabe
83
20
0
29 Sep 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Avamarie Brueggeman
Andrea Madotto
Zhaojiang Lin
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
95
94
0
27 Sep 2023
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics
  Description for Prompt-based Control
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control
Aya Watanabe
Shinnosuke Takamichi
Yuki Saito
Wataru Nakata
Detai Xin
Hiroshi Saruwatari
57
11
0
24 Sep 2023
Weakly-supervised Automated Audio Captioning via text only training
Weakly-supervised Automated Audio Captioning via text only training
Theodoros Kouzelis
Vassilis Katsouros
CLIP
86
7
0
21 Sep 2023
A Large-scale Dataset for Audio-Language Representation Learning
A Large-scale Dataset for Audio-Language Representation Learning
Luoyi Sun
Xuenan Xu
Mengyue Wu
Weidi Xie
96
27
0
20 Sep 2023
RECAP: Retrieval-Augmented Audio Captioning
RECAP: Retrieval-Augmented Audio Captioning
Sreyan Ghosh
Sonal Kumar
Chandra Kiran Reddy Evuru
R. Duraiswami
Tianyi Zhou
VLM
100
21
0
18 Sep 2023
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
Synth-AC: Enhancing Audio Captioning with Synthetic Supervision
Feiyang Xiao
Qiaoxi Zhu
Jian Guan
Xubo Liu
Haohe Liu
Kejia Zhang
Wenwu Wang
66
2
0
18 Sep 2023
Contrastive Latent Space Reconstruction Learning for Audio-Text
  Retrieval
Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval
Kaiyi Luo
Xulong Zhang
Jianzong Wang
Huaxiong Li
Ning Cheng
Jing Xiao
113
2
0
16 Sep 2023
Enhance audio generation controllability through representation
  similarity regularization
Enhance audio generation controllability through representation similarity regularization
Yangyang Shi
Gaël Le Lan
Varun K. Nagaraja
Zhaoheng Ni
Xinhao Mei
Ernie Chang
Forrest N. Iandola
Yang Liu
Vikas Chandra
68
1
0
15 Sep 2023
Audio-free Prompt Tuning for Language-Audio Models
Audio-free Prompt Tuning for Language-Audio Models
Yiming Li
Xiangdong Wang
Hong Liu
CLIPVLM
74
10
0
15 Sep 2023
Audio Difference Learning for Audio Captioning
Audio Difference Learning for Audio Captioning
Tatsuya Komatsu
Yusuke Fujita
K. Takeda
Tomoki Toda
80
4
0
15 Sep 2023
Multilingual Audio Captioning using machine translated data
Multilingual Audio Captioning using machine translated data
Matéo Cousin
Etienne Labbé
Thomas Pellegrini
106
4
0
14 Sep 2023
Training Audio Captioning Models without Audio
Training Audio Captioning Models without Audio
Soham Deshmukh
Benjamin Elizalde
Dimitra Emmanouilidou
Bhiksha Raj
Rita Singh
Huaming Wang
61
20
0
14 Sep 2023
Natural Language Supervision for General-Purpose Audio Representations
Natural Language Supervision for General-Purpose Audio Representations
Benjamin Elizalde
Soham Deshmukh
Huaming Wang
AuLLMAI4TS
92
59
0
11 Sep 2023
NExT-GPT: Any-to-Any Multimodal LLM
NExT-GPT: Any-to-Any Multimodal LLM
Shengqiong Wu
Hao Fei
Leigang Qu
Wei Ji
Tat-Seng Chua
MLLM
125
507
0
11 Sep 2023
Parameter Efficient Audio Captioning With Faithful Guidance Using
  Audio-text Shared Latent Representation
Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation
A. Sridhar
Yinyi Guo
Erik M. Visser
Rehana Mahfuz
105
5
0
06 Sep 2023
Generating Realistic Images from In-the-wild Sounds
Generating Realistic Images from In-the-wild Sounds
Taegyeong Lee
Jeonghun Kang
Hyeonyu Kim
Taehwan Kim
DiffM
79
3
0
05 Sep 2023
CoNeTTE: An efficient Audio Captioning system leveraging multiple
  datasets with Task Embedding
CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding
Etienne Labbé
Thomas Pellegrini
J. Pinquier
86
14
0
01 Sep 2023
Previous
123456
Next