ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17589
73
0

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

23 May 2025
Zhihao Du
Changfeng Gao
Yuxuan Wang
Fan Yu
Tianyu Zhao
Hao Wang
Xiang Lv
Hui Wang
Xian Shi
Keyu An
Guanrou Yang
Yabin Li
Yabin Li
Yanni Chen
Zhifu Gao
Qian Chen
Yue Gu
Mengzhe Chen
Yafeng Chen
Shiliang Zhang
Wen Wang
Jieping Ye
    AuLLM
ArXivPDFHTML
Abstract

In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo atthis https URL.

View on arXiv
@article{du2025_2505.17589,
  title={ CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training },
  author={ Zhihao Du and Changfeng Gao and Yuxuan Wang and Fan Yu and Tianyu Zhao and Hao Wang and Xiang Lv and Hui Wang and Chongjia Ni and Xian Shi and Keyu An and Guanrou Yang and Yabin Li and Yanni Chen and Zhifu Gao and Qian Chen and Yue Gu and Mengzhe Chen and Yafeng Chen and Shiliang Zhang and Wen Wang and Jieping Ye },
  journal={arXiv preprint arXiv:2505.17589},
  year={ 2025 }
}
Comments on this paper