ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.09387
  4. Cited By
Clotho: An Audio Captioning Dataset

Clotho: An Audio Captioning Dataset

21 October 2019
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
ArXiv (abs)PDFHTML

Papers citing "Clotho: An Audio Captioning Dataset"

50 / 269 papers shown
Title
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image
  Captioner using Audiovisual Distribution Alignment
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard
Michel Olvera
Stéphane Lathuilière
S. Essid
VLM
66
0
0
08 Oct 2024
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Zixuan Wang
Chi-Keung Tang
Chi-Keung Tang
DiffMVGenLLMAG
109
4
0
04 Oct 2024
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel
  Approach using Gloss-based Annotation
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation
Sen Fang
Sizhou Chen
Yalin Feng
Xiaofeng Zhang
T. Teoh
63
0
0
04 Oct 2024
FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech
  Language Model
FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
Yichen Lu
Jiaqi Song
Chao-Han Huck Yang
Shinji Watanabe
92
0
0
03 Oct 2024
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data
Sreyan Ghosh
Sonal Kumar
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
Dinesh Manocha
DiffM
129
3
0
02 Oct 2024
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large
  Language Models
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
Yiming Chen
Xianghu Yue
Xiaoxue Gao
Chen Zhang
L. F. D’Haro
R. Tan
Haizhou Li
AuLLM
146
2
0
27 Sep 2024
Language-based Audio Moment Retrieval
Language-based Audio Moment Retrieval
Hokuto Munakata
Taichi Nishimura
Shota Nakada
Tatsuya Komatsu
131
2
0
24 Sep 2024
OmniBench: Towards The Future of Universal Omni-Language Models
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
176
19
0
23 Sep 2024
Exploring Text-Queried Sound Event Detection with Audio Source Separation
Exploring Text-Queried Sound Event Detection with Audio Source Separation
Han Yin
Jisheng Bai
Yang Xiao
Hui Wang
Siqi Zheng
Yafeng Chen
Rohan Kumar Das
Chong Deng
Jianfeng Chen
106
4
0
20 Sep 2024
Leveraging Audio-Only Data for Text-Queried Target Sound Extraction
Leveraging Audio-Only Data for Text-Queried Target Sound Extraction
Kohei Saijo
Janek Ebbers
François Germain
Sameer Khurana
Gordon Wichern
Jonathan Le Roux
101
1
0
20 Sep 2024
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
Tsung-Han Wu
Joseph E. Gonzalez
Trevor Darrell
David M. Chan
127
2
0
19 Sep 2024
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul
Guangzhi Sun
Warit Sirichotedumrong
Kasima Tharnpipitchai
Kunat Pipatanakul
AuLLM
124
7
0
17 Sep 2024
Language-Queried Target Sound Extraction Without Parallel Training Data
Language-Queried Target Sound Extraction Without Parallel Training Data
Hao Ma
Zhiyuan Peng
Xu Li
Yukai Li
Mingjie Shao
Qiuqiang Kong
Xuelong Li
VLM
187
2
0
14 Sep 2024
Recall: Empowering Multimodal Embedding for Edge Devices
Recall: Empowering Multimodal Embedding for Edge Devices
Dongqi Cai
Shangguang Wang
Chen Peng
Zeling Zhang
Mengwei Xu
61
3
0
09 Sep 2024
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio
  Captioning Performance
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim
Minjeon Jeon
Jaeyoon Jung
Sang Hoon Woo
Jinjoo Lee
80
3
0
02 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio
  Captioning
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
88
1
0
02 Sep 2024
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Andreea-Maria Oncescu
João F. Henriques
A. Sophia Koepke
91
2
0
01 Sep 2024
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural
  Language Description
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
Zeyu Jin
Jia Jia
Qixin Wang
Kehan Li
Shuoyi Zhou
Songtao Zhou
Xiaoyu Qin
Zhiyong Wu
100
14
0
24 Aug 2024
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot
  Learning
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
Tiago Tavares
Fabio Ayres
Zhepei Wang
Paris Smaragdis
VLM
55
2
0
23 Aug 2024
Estimated Audio-Caption Correspondences Improve Language-Based Audio
  Retrieval
Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval
Paul Primus
Florian Schmid
Gerhard Widmer
83
2
0
21 Aug 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross
  Modal Retrieval
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
69
0
0
28 Jul 2024
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh
Shuo Han
Hazim T. Bukhari
Benjamin Elizalde
Hannes Gamper
Rita Singh
Bhiksha Raj
ReLMLRMAuLLM
87
11
0
25 Jul 2024
Computer Audition: From Task-Specific Machine Learning to Foundation
  Models
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Tuomas Virtanen
Björn Schuller
96
4
0
22 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
106
13
0
16 Jul 2024
Qwen2-Audio Technical Report
Qwen2-Audio Technical Report
Yunfei Chu
Jin Xu
Qian Yang
Haojie Wei
Xipin Wei
...
Yuanjun Lv
Jinzheng He
Junyang Lin
Chang Zhou
Jingren Zhou
AuLLMVLM
106
161
0
15 Jul 2024
S3: A Simple Strong Sample-effective Multimodal Dialog System
S3: A Simple Strong Sample-effective Multimodal Dialog System
Elisei Rykov
Egor Malkershin
Alexander Panchenko
87
0
0
26 Jun 2024
Fusing Audio and Metadata Embeddings Improves Language-based Audio
  Retrieval
Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval
Paul Primus
Gerhard Widmer
74
3
0
22 Jun 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
  Complex Reasoning Abilities
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh
Sonal Kumar
Ashish Seth
Chandra Kiran Reddy Evuru
Utkarsh Tyagi
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLMLRM
107
61
0
17 Jun 2024
Performance Improvement of Language-Queried Audio Source Separation
  Based on Caption Augmentation From Large Language Models for DCASE Challenge
  2024 Task 9
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
Do Hyun Lee
Yoonah Song
Hong Kook Kim
62
3
0
17 Jun 2024
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley
  Audio Content Planning and Generation
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
Ruibo Fu
Shuchen Shi
Hongming Guo
Tao Wang
Chunyu Qiang
...
Zhiyong Wang
Yukun Liu
Xuefei Liu
Shuai Zhang
Guanjun Li
VGen
45
0
0
15 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
82
1
0
13 Jun 2024
Zero-Shot Audio Captioning Using Soft and Hard Prompts
Zero-Shot Audio Captioning Using Soft and Hard Prompts
Yiming Zhang
Xuenan Xu
Ruoyi Du
Haohe Liu
Yuan Dong
Zheng-Hua Tan
Wenwu Wang
Zhanyu Ma
VLM
77
4
0
10 Jun 2024
Soundscape Captioning using Sound Affective Quality Network and Large
  Language Model
Soundscape Captioning using Sound Affective Quality Network and Large Language Model
Yuanbo Hou
Qiaoqiao Ren
A. Mitchell
Wenwu Wang
Jian Kang
Tony Belpaeme
Dick Botteldooren
103
3
0
09 Jun 2024
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
Huadai Liu
Rongjie Huang
Yang Liu
Hengyuan Cao
Jialei Wang
Xize Cheng
Siqi Zheng
Zhou Zhao
127
10
0
01 Jun 2024
Listenable Maps for Zero-Shot Audio Classifiers
Listenable Maps for Zero-Shot Audio Classifiers
Francesco Paissan
Luca Della Libera
Mirco Ravanelli
Cem Subakan
107
4
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
141
5
0
26 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
137
45
0
26 May 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min Zhang
MoE
111
42
0
18 May 2024
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Manh Luong
Khai Nguyen
Nhat Ho
Reza Haf
D.Q. Phung
Lizhen Qu
75
13
0
16 May 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Zehan Wang
Ziang Zhang
Xize Cheng
Rongjie Huang
Luping Liu
...
Haifeng Huang
Yang Zhao
Tao Jin
Peng Gao
Zhou Zhao
76
10
0
08 May 2024
Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data
  Manipulation
Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation
Yoori Oh
Yoseob Han
Kyogu Lee
78
1
0
01 May 2024
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Yiitan Yuan
Zhuo Chen
Xubo Liu
Haohe Liu
Xuenan Xu
Dongya Jia
Yuanzhe Chen
Mark D. Plumbley
Wenwu Wang
CLIPVLM
71
12
0
27 Apr 2024
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
66
5
0
22 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
88
79
0
22 Mar 2024
Building speech corpus with diverse voice characteristics for its
  prompt-based representation
Building speech corpus with diverse voice characteristics for its prompt-based representation
Aya Watanabe
Shinnosuke Takamichi
Yuki Saito
Wataru Nakata
Detai Xin
Hiroshi Saruwatari
65
1
0
20 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for
  Audio-Text Cross Retrieval
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
67
1
0
16 Mar 2024
Multiscale Matching Driven by Cross-Modal Similarity Consistency for
  Audio-Text Retrieval
Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval
Qian Wang
Jia-Chen Gu
Zhen-Hua Ling
61
2
0
15 Mar 2024
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
Xuenan Xu
Xiaohang Xu
Zeyu Xie
Pingyue Zhang
Mengyue Wu
Kai Yu
66
6
0
07 Mar 2024
A SOUND APPROACH: Using Large Language Models to generate audio
  descriptions for egocentric text-audio retrieval
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
Andreea-Maria Oncescu
João F. Henriques
Andrew Zisserman
Samuel Albanie
A. Sophia Koepke
72
5
0
29 Feb 2024
Speech Translation with Speech Foundation Models and Large Language
  Models: What is There and What is Missing?
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
133
18
0
19 Feb 2024
Previous
123456
Next