ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.09387
  4. Cited By
Clotho: An Audio Captioning Dataset

Clotho: An Audio Captioning Dataset

21 October 2019
K. Drossos
Samuel Lipping
Tuomas Virtanen
ArXivPDFHTML

Papers citing "Clotho: An Audio Captioning Dataset"

50 / 259 papers shown
Title
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
Tsung-Han Wu
Joseph E. Gonzalez
Trevor Darrell
David M. Chan
27
2
0
19 Sep 2024
Enhancing Low-Resource Language and Instruction Following Capabilities
  of Audio Language Models
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul
Guangzhi Sun
Warit Sirichotedumrong
Kasima Tharnpipitchai
Kunat Pipatanakul
AuLLM
44
4
0
17 Sep 2024
Language-Queried Target Sound Extraction Without Parallel Training Data
Language-Queried Target Sound Extraction Without Parallel Training Data
Hao Ma
Zhiyuan Peng
Xu Li
Yukai Li
Mingjie Shao
Qiuqiang Kong
Xuelong Li
VLM
80
1
0
14 Sep 2024
Recall: Empowering Multimodal Embedding for Edge Devices
Recall: Empowering Multimodal Embedding for Edge Devices
Dongqi Cai
Shangguang Wang
Chen Peng
Zeling Zhang
Mengwei Xu
29
3
0
09 Sep 2024
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio
  Captioning Performance
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Jaeyeon Kim
Minjeon Jeon
Jaeyoon Jung
Sang Hoon Woo
Jinjoo Lee
34
2
0
02 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio
  Captioning
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
24
1
0
02 Sep 2024
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Dissecting Temporal Understanding in Text-to-Audio Retrieval
Andreea-Maria Oncescu
João F. Henriques
A. Sophia Koepke
30
2
0
01 Sep 2024
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural
  Language Description
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
Zeyu Jin
Jia Jia
Qixin Wang
Kehan Li
Shuoyi Zhou
Songtao Zhou
Xiaoyu Qin
Zhiyong Wu
36
10
0
24 Aug 2024
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot
  Learning
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot Learning
Tiago Tavares
Fabio Ayres
Zhepei Wang
Paris Smaragdis
VLM
31
2
0
23 Aug 2024
Estimated Audio-Caption Correspondences Improve Language-Based Audio
  Retrieval
Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval
Paul Primus
Florian Schmid
Gerhard Widmer
37
2
0
21 Aug 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross
  Modal Retrieval
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
45
0
0
28 Jul 2024
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
Soham Deshmukh
Shuo Han
Hazim T. Bukhari
Benjamin Elizalde
Hannes Gamper
Rita Singh
Bhiksha Raj
ReLM
LRM
AuLLM
35
7
0
25 Jul 2024
Computer Audition: From Task-Specific Machine Learning to Foundation
  Models
Computer Audition: From Task-Specific Machine Learning to Foundation Models
Andreas Triantafyllopoulos
Iosif Tsangko
Alexander Gebhard
A. Mesaros
Tuomas Virtanen
Björn Schuller
47
4
0
22 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
46
9
0
16 Jul 2024
Qwen2-Audio Technical Report
Qwen2-Audio Technical Report
Yunfei Chu
Jin Xu
Qian Yang
Haojie Wei
Xipin Wei
...
Yuanjun Lv
Jinzheng He
Junyang Lin
Chang Zhou
Jingren Zhou
AuLLM
VLM
39
111
0
15 Jul 2024
S3: A Simple Strong Sample-effective Multimodal Dialog System
S3: A Simple Strong Sample-effective Multimodal Dialog System
Elisei Rykov
Egor Malkershin
Alexander Panchenko
28
0
0
26 Jun 2024
Fusing Audio and Metadata Embeddings Improves Language-based Audio
  Retrieval
Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval
Paul Primus
Gerhard Widmer
52
3
0
22 Jun 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
  Complex Reasoning Abilities
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Sreyan Ghosh
Sonal Kumar
Ashish Seth
Chandra Kiran Reddy Evuru
Utkarsh Tyagi
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
LRM
46
43
0
17 Jun 2024
Performance Improvement of Language-Queried Audio Source Separation
  Based on Caption Augmentation From Large Language Models for DCASE Challenge
  2024 Task 9
Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
Do Hyun Lee
Yoonah Song
Hong Kook Kim
24
2
0
17 Jun 2024
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley
  Audio Content Planning and Generation
MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
Ruibo Fu
Shuchen Shi
Hongming Guo
Tao Wang
Chunyu Qiang
...
Zhiyong Wang
Yukun Liu
Xuefei Liu
Shuai Zhang
Guanjun Li
VGen
30
0
0
15 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
49
1
0
13 Jun 2024
Zero-Shot Audio Captioning Using Soft and Hard Prompts
Zero-Shot Audio Captioning Using Soft and Hard Prompts
Yiming Zhang
Xuenan Xu
Ruoyi Du
Haohe Liu
Yuan Dong
Zheng-Hua Tan
Wenwu Wang
Zhanyu Ma
VLM
35
4
0
10 Jun 2024
Soundscape Captioning using Sound Affective Quality Network and Large
  Language Model
Soundscape Captioning using Sound Affective Quality Network and Large Language Model
Yuanbo Hou
Qiaoqiao Ren
A. Mitchell
Wenwu Wang
Jian Kang
Tony Belpaeme
Dick Botteldooren
42
3
0
09 Jun 2024
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
AudioLCM: Text-to-Audio Generation with Latent Consistency Models
Huadai Liu
Rongjie Huang
Yang Liu
Hengyuan Cao
Jialei Wang
Xize Cheng
Siqi Zheng
Zhou Zhao
73
8
0
01 Jun 2024
Listenable Maps for Zero-Shot Audio Classifiers
Listenable Maps for Zero-Shot Audio Classifiers
Francesco Paissan
Luca Della Libera
Mirco Ravanelli
Cem Subakan
40
4
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to
  Multimodal Inputs
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
71
5
0
26 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
60
37
0
26 May 2024
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li
Shenyuan Jiang
Baotian Hu
Longyue Wang
Wanqi Zhong
Wenhan Luo
Lin Ma
Min-Ling Zhang
MoE
46
30
0
18 May 2024
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Manh Luong
Khai Nguyen
Nhat Ho
Reza Haf
D.Q. Phung
Lizhen Qu
38
12
0
16 May 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Zehan Wang
Ziang Zhang
Xize Cheng
Rongjie Huang
Luping Liu
...
Haifeng Huang
Yang Zhao
Tao Jin
Peng Gao
Zhou Zhao
37
9
0
08 May 2024
Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data
  Manipulation
Distance Sampling-based Paraphraser Leveraging ChatGPT for Text Data Manipulation
Yoori Oh
Yoseob Han
Kyogu Lee
39
1
0
01 May 2024
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Yiitan Yuan
Zhuo Chen
Xubo Liu
Haohe Liu
Xuenan Xu
Dongya Jia
Yuanzhe Chen
Mark D. Plumbley
Wenwu Wang
CLIP
VLM
40
10
0
27 Apr 2024
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
39
4
0
22 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
42
50
0
22 Mar 2024
Building speech corpus with diverse voice characteristics for its
  prompt-based representation
Building speech corpus with diverse voice characteristics for its prompt-based representation
Aya Watanabe
Shinnosuke Takamichi
Yuki Saito
Wataru Nakata
Detai Xin
Hiroshi Saruwatari
40
0
0
20 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for
  Audio-Text Cross Retrieval
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
26
1
0
16 Mar 2024
Multiscale Matching Driven by Cross-Modal Similarity Consistency for
  Audio-Text Retrieval
Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval
Qian Wang
Jia-Chen Gu
Zhen-Hua Ling
35
2
0
15 Mar 2024
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
Xuenan Xu
Xiaohang Xu
Zeyu Xie
Pingyue Zhang
Mengyue Wu
Kai Yu
28
6
0
07 Mar 2024
A SOUND APPROACH: Using Large Language Models to generate audio
  descriptions for egocentric text-audio retrieval
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
Andreea-Maria Oncescu
João F. Henriques
Andrew Zisserman
Samuel Albanie
A. Sophia Koepke
35
5
0
29 Feb 2024
Speech Translation with Speech Foundation Models and Large Language
  Models: What is There and What is Missing?
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido
Sara Papi
Matteo Negri
L. Bentivogli
49
13
0
19 Feb 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative
  Comprehension
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang
Jin Xu
Wenrui Liu
Yunfei Chu
Ziyue Jiang
...
Yichong Leng
Yuanjun Lv
Zhou Zhao
Chang Zhou
Jingren Zhou
LM&MA
AuLLM
ALM
49
64
0
12 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
46
11
0
10 Feb 2024
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and
  Dialogue Abilities
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong
Arushi Goel
Rohan Badlani
Ming-Yu Liu
Rafael Valle
Bryan Catanzaro
AuLLM
LM&MA
MLLM
78
75
0
02 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for
  Automated Audio Captioning
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIP
VLM
25
22
0
31 Jan 2024
A Survey on Data Augmentation in Large Model Era
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MA
VLM
51
24
0
27 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
44
0
0
22 Jan 2024
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal
  Data
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
Yuhui Zhang
Elaine Sui
Serena Yeung-Levy
44
9
0
16 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding Model
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
38
38
0
11 Jan 2024
Learning Audio Concepts from Counterfactual Natural Language
Learning Audio Concepts from Counterfactual Natural Language
A. Vosoughi
Luca Bondi
Ho-Hsiang Wu
Chenliang Xu
CML
47
3
0
10 Jan 2024
Towards Weakly Supervised Text-to-Audio Grounding
Towards Weakly Supervised Text-to-Audio Grounding
Xuenan Xu
Ziyang Ma
Mengyue Wu
Kai Yu
AI4TS
36
9
0
05 Jan 2024
Previous
123456
Next