RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

26 May 2025

Abstract

The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.

View on arXiv

@article{sun2025_2505.19437,
  title={ RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval },
  author={ Haoqin Sun and Jingguang Tian and Jiaming Zhou and Hui Wang and Jiabei He and Shiwan Zhao and Xiangyu Kong and Desheng Hu and Xinkang Xu and Xinhui Hu and Yong Qin },
  journal={arXiv preprint arXiv:2505.19437},
  year={ 2025 }
}

Comments on this paper