Discrete Audio Representations for Automated Audio Captioning

Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to automated audio captioning (AAC) remains underexplored. This paper systematically investigates the viability of audio token-driven models for AAC through comparative analyses of various tokenization methods. Our findings reveal that audio tokenization leads to performance degradation in AAC models compared to those that directly utilize continuous audio representations. To address this issue, we introduce a supervised audio tokenizer trained with an audio tagging objective. Unlike unsupervised tokenizers, which lack explicit semantic understanding, the proposed tokenizer effectively captures audio event information. Experiments conducted on the Clotho dataset demonstrate that the proposed audio tokens outperform conventional audio tokens in the AAC task.
View on arXiv@article{tian2025_2505.14989, title={ Discrete Audio Representations for Automated Audio Captioning }, author={ Jingguang Tian and Haoqin Sun and Xinhui Hu and Xinkang Xu }, journal={arXiv preprint arXiv:2505.14989}, year={ 2025 } }