ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.09387
  4. Cited By
Clotho: An Audio Captioning Dataset

Clotho: An Audio Captioning Dataset

21 October 2019
Konstantinos Drossos
Samuel Lipping
Tuomas Virtanen
ArXivPDFHTML

Papers citing "Clotho: An Audio Captioning Dataset"

50 / 259 papers shown
Title
IteraTTA: An interface for exploring both text prompts and audio priors
  in generating music with text-to-audio models
IteraTTA: An interface for exploring both text prompts and audio priors in generating music with text-to-audio models
Hiromu Yakura
Masataka Goto
29
2
0
24 Jul 2023
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao
Zhijie Lin
Daquan Zhou
Zilong Huang
Jiashi Feng
Bingyi Kang
MLLM
44
108
0
17 Jul 2023
A Demand-Driven Perspective on Generative Audio AI
A Demand-Driven Perspective on Generative Audio AI
Sangshin Oh
Minsung Kang
Hyeongi Moon
Keunwoo Choi
Ben Sangbae Chon
33
3
0
10 Jul 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Xu
Enhong Chen
MLLM
LRM
62
562
0
23 Jun 2023
Towards Unseen Triples: Effective Text-Image-joint Learning for Scene
  Graph Generation
Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation
Qianji Di
Wenxing Ma
Zhongang Qi
Tianxiang Hou
Ying Shan
Hanzi Wang
27
0
0
23 Jun 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
36
2
0
21 Jun 2023
Improving Audio Caption Fluency with Automatic Error Correction
Improving Audio Caption Fluency with Automatic Error Correction
Hanxue Zhang
Zeyu Xie
Xuenan Xu
Mengyue Wu
K. Yu
26
0
0
16 Jun 2023
Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances
Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances
Huang Xie
Khazar Khorrami
Okko Rasanen
Tuomas Virtanen
24
4
0
16 Jun 2023
FALL-E: A Foley Sound Synthesis Model and Strategies
FALL-E: A Foley Sound Synthesis Model and Strategies
Minsung Kang
Sangshin Oh
Hyeongi Moon
Kyungyun Lee
Ben Sangbae Chon
28
4
0
16 Jun 2023
Enhance Temporal Relations in Audio Captioning with Sound Event
  Detection
Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Zeyu Xie
Xuenan Xu
Mengyue Wu
K. Yu
31
10
0
02 Jun 2023
Adapting a ConvNeXt model to audio classification on AudioSet
Adapting a ConvNeXt model to audio classification on AudioSet
Thomas Pellegrini
Ismail Khalfaoui-Hassani
Etienne Labbé
T. Masquelier
14
22
0
01 Jun 2023
Attention-Based Methods For Audio Question Answering
Attention-Based Methods For Audio Question Answering
Parthasaarathy Sudarsanam
Tuomas Virtanen
25
2
0
31 May 2023
Dual Transformer Decoder based Features Fusion Network for Automated
  Audio Captioning
Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning
Jianyuan Sun
Xubo Liu
Xinhao Mei
V. Kılıç
Mark D. Plumbley
Wenwu Wang
33
3
0
30 May 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Jiaheng Liu
42
97
0
29 May 2023
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation
Jia-Bin Huang
Yi Ren
Rongjie Huang
Dongchao Yang
Zhenhui Ye
Chen Zhang
Jinglin Liu
Xiang Yin
Zejun Ma
Zhou Zhao
DiffM
37
59
0
29 May 2023
Multi-Scale Attention for Audio Question Answering
Multi-Scale Attention for Audio Question Answering
Guangyao Li
Yixin Xu
Di Hu
30
16
0
29 May 2023
CAPTDURE: Captioned Sound Dataset of Single Sources
CAPTDURE: Captioned Sound Dataset of Single Sources
Yuki Okamoto
Kanta Shimonishi
Keisuke Imoto
Kota Dohi
Shota Horiguchi
Y. Kawaguchi
32
1
0
28 May 2023
ChatBridge: Bridging Modalities with Large Language Model as a Language
  Catalyst
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zijia Zhao
Longteng Guo
Tongtian Yue
Si-Qing Chen
Shuai Shao
Xinxin Zhu
Zehuan Yuan
Jing Liu
MLLM
40
53
0
25 May 2023
Connecting Multi-modal Contrastive Representations
Connecting Multi-modal Contrastive Representations
Zehan Wang
Yang Zhao
Xize Cheng
Haifeng Huang
Jiageng Liu
...
Lin Li
Yongqiang Wang
Aoxiong Yin
Ziang Zhang
Zhou Zhao
30
22
0
22 May 2023
Pengi: An Audio Language Model for Audio Tasks
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
45
161
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
53
116
0
18 May 2023
Listen, Think, and Understand
Listen, Think, and Understand
Yuan Gong
Hongyin Luo
Alexander H. Liu
Leonid Karlinsky
James R. Glass
ELM
MLLM
LRM
43
141
0
18 May 2023
A Whisper transformer for audio captioning trained with synthetic
  captions and transfer learning
A Whisper transformer for audio captioning trained with synthetic captions and transfer learning
Marek Kadlcík
Adam Hájek
Jürgen Kieslich
Radoslaw Winiecki
VLM
8
11
0
15 May 2023
Diverse and Vivid Sound Generation from Text Descriptions
Diverse and Vivid Sound Generation from Text Descriptions
Guangwei Li
Xuenan Xu
Lingfeng Dai
Mengyue Wu
K. Yu
53
4
0
03 May 2023
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Zhepei Wang
Cem Subakan
Krishna Subramani
Junkai Wu
Tiago Tavares
Fabio Ayres
Paris Smaragdis
SSL
27
3
0
03 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
34
104
0
17 Apr 2023
Graph Attention for Automated Audio Captioning
Graph Attention for Automated Audio Captioning
Feiyang Xiao
Jian Guan
Qiaoxi Zhu
Wenwu Wang
22
8
0
07 Apr 2023
Prefix tuning for automated audio captioning
Prefix tuning for automated audio captioning
Minkyu Kim
Kim Sung-Bin
Tae-Hyun Oh
21
43
0
30 Mar 2023
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for
  Audio-Language Multimodal Research
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Xinhao Mei
Chutong Meng
Haohe Liu
Qiuqiang Kong
Tom Ko
Chengqi Zhao
Mark D. Plumbley
Yuexian Zou
Wenwu Wang
55
196
0
30 Mar 2023
Fine-grained Audible Video Description
Fine-grained Audible Video Description
Xuyang Shen
Dong Li
Jinxing Zhou
Zhen Qin
Bowen He
...
Yuchao Dai
Lingpeng Kong
Meng Wang
Yu Qiao
Yiran Zhong
VGen
41
11
0
27 Mar 2023
Audio-Text Models Do Not Yet Leverage Natural Language
Audio-Text Models Do Not Yet Leverage Natural Language
Ho-Hsiang Wu
Oriol Nieto
J. P. Bello
Justin Salamon
VLM
19
28
0
19 Mar 2023
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet
  Tag-guided Synthetic Data
BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data
Xuenan Xu
Zhiling Zhang
Zelin Zhou
Pingyue Zhang
Zeyu Xie
Mengyue Wu
Ke Zhu
CLIP
75
14
0
14 Mar 2023
Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior
  Matrix Revised Loss
Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss
Yifei Xin
Dongchao Yang
Yuexian Zou
44
31
0
10 Mar 2023
Exploring Efficient-Tuned Learning Audio Representation Method from
  BriVL
Exploring Efficient-Tuned Learning Audio Representation Method from BriVL
Sen Fang
Yang Wu
Bowen Gao
Jingwen Cai
T. Teoh
DiffM
29
1
0
08 Mar 2023
Training sound event detection with soft labels from crowdsourced
  annotations
Training sound event detection with soft labels from crowdsourced annotations
Irene Martín-Morató
Manu Harju
Paul Ahokas
A. Mesaros
23
16
0
28 Feb 2023
Data leakage in cross-modal retrieval training: A case study
Data leakage in cross-modal retrieval training: A case study
Benno Weck
Xavier Serra
31
7
0
23 Feb 2023
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
  Models
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Rongjie Huang
Jia-Bin Huang
Dongchao Yang
Yi Ren
Luping Liu
Mingze Li
Zhenhui Ye
Jinglin Liu
Xiaoyue Yin
Zhou Zhao
DiffM
151
318
0
30 Jan 2023
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Haohe Liu
Zehua Chen
Yiitan Yuan
Xinhao Mei
Xubo Liu
Danilo Mandic
Wenwu Wang
Mark D. Plumbley
DiffM
49
473
0
29 Jan 2023
MAViL: Masked Audio-Video Learners
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
31
52
0
15 Dec 2022
Towards Generating Diverse Audio Captions via Adversarial Training
Towards Generating Diverse Audio Captions via Adversarial Training
Xinhao Mei
Xubo Liu
Jianyuan Sun
Mark D. Plumbley
Wenwu Wang
DiffM
41
2
0
05 Dec 2022
Impact of visual assistance for automated audio captioning
Impact of visual assistance for automated audio captioning
Wim Boes
Hugo Van hamme
17
1
0
18 Nov 2022
Describing emotions with acoustic property prompts for speech emotion
  recognition
Describing emotions with acoustic property prompts for speech emotion recognition
Hira Dhamyal
Benjamin Elizalde
Soham Deshmukh
Huaming Wang
Bhiksha Raj
Rita Singh
26
10
0
14 Nov 2022
Is my automatic audio captioning system so bad? spider-max: a metric to
  consider several caption candidates
Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates
Etienne Labbé
Thomas Pellegrini
J. Pinquier
14
4
0
14 Nov 2022
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu
K. Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
39
490
0
12 Nov 2022
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and
  Evaluating Suitability of Language-Centric Performance Metrics
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics
Sandeep Reddy Kothinti
Dimitra Emmanouilidou
14
3
0
12 Nov 2022
On Negative Sampling for Contrastive Audio-Text Retrieval
On Negative Sampling for Contrastive Audio-Text Retrieval
Huang Xie
Okko Rasanen
Tuomas Virtanen
33
7
0
08 Nov 2022
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
Xubo Liu
Qiushi Huang
Xinhao Mei
Haohe Liu
Qiuqiang Kong
...
Yu Zhang
Lilian H. Y. Tang
Mark D. Plumbley
Volkan Kilicc
Wenwu Wang
50
18
0
28 Oct 2022
Automated Audio Captioning via Fusion of Low- and High- Dimensional
  Features
Automated Audio Captioning via Fusion of Low- and High- Dimensional Features
Jianyuan Sun
Xubo Liu
Xinhao Mei
Mark D. Plumbley
V. Kılıç
Wenwu Wang
33
3
0
10 Oct 2022
Matching Text and Audio Embeddings: Exploring Transfer-learning
  Strategies for Language-based Audio Retrieval
Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval
Benno Weck
Miguel Pérez Fernández
Holger Kirchhoff
Xavier Serra
15
3
0
06 Oct 2022
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption
  Similarity
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity
Swapnil Bhosale
Rupayan Chakraborty
Sunil Kumar Kopparapu
27
1
0
03 Oct 2022
Previous
123456
Next