ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17589
  4. Cited By
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

23 May 2025
Zhihao Du
Changfeng Gao
Yuxuan Wang
Fan Yu
Tianyu Zhao
Hao Wang
Xiang Lv
Hui Wang
Xian Shi
Keyu An
Guanrou Yang
Yabin Li
Yabin Li
Yanni Chen
Zhifu Gao
Qian Chen
Yue Gu
Mengzhe Chen
Yafeng Chen
Shiliang Zhang
Wen Wang
Jieping Ye
    AuLLM
ArXivPDFHTML

Papers citing "CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training"

48 / 48 papers shown
Title
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun
Ruitong Xiao
Jianye Mo
Bowen Wu
Qun Yu
Baoxun Wang
73
2
0
03 Apr 2025
Qwen2.5-Omni Technical Report
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
Xinyu Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
116
31
0
26 Mar 2025
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Xiang Wang
Mingqi Jiang
Zejun Ma
Ziyu Zhang
Shixuan Liu
...
Zhifei Li
Xie Chen
Lei Xie
Yu Guo
Wei Xue
103
16
0
03 Mar 2025
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Wei Deng
Siyi Zhou
Jingchen Shu
Jinchao Wang
Lu Wang
VLM
66
4
0
08 Feb 2025
Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation
Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation
Zhengyan Sheng
Zhihao Du
Heng Lu
Shiliang Zhang
Zhen-Hua Ling
38
2
0
11 Jan 2025
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Qian Chen
Yafeng Chen
Yanni Chen
Mengzhe Chen
Yuxiao Chen
...
Shiliang Zhang
Nan Zhao
Pei Zhang
Chuxu Zhang
Jinren Zhou
AuLLM
MLLM
55
20
0
10 Jan 2025
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
78
73
0
09 Oct 2024
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
Hao-Han Guo
Kun Liu
Fei-Yu Shen
Yi-Chen Wu
Xu Tang
Kun Xie
Kai-Tuo Xu
Kun Xie
Kai-Tuo Xu
64
26
0
05 Sep 2024
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
  Transformer
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang
Haoyue Zhan
Liwei Liu
Ruihong Zeng
Haotian Guo
Jiachen Zheng
Qiang Zhang
Shunsi Zhang
Shunsi Zhang
Zhizheng Wu
63
51
0
01 Sep 2024
Autoregressive Speech Synthesis without Vector Quantization
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
Long Zhou
Shujie Liu
Sanyuan Chen
Bing Han
...
Jinyu Li
Sheng Zhao
Xixin Wu
Helen M. Meng
Furu Wei
91
40
0
11 Jul 2024
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer
  based on Supervised Semantic Tokens
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens
Zhihao Du
Qian Chen
Shiliang Zhang
Kai Hu
Heng Lu
...
Siqi Zheng
Yue Gu
Ziyang Ma
Zhifu Gao
Zhijie Yan
DiffM
35
124
0
07 Jul 2024
FunAudioLLM: Voice Understanding and Generation Foundation Models for
  Natural Interaction Between Humans and LLMs
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Keyu An
Qian Chen
Chong Deng
Zhihao Du
Changfeng Gao
...
Bin Zhang
Qinglin Zhang
Shiliang Zhang
Nan Zhao
Siqi Zheng
AuLLM
70
53
0
04 Jul 2024
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez
Xiaofei Wang
Manthan Thakker
Canrun Li
Chung-Hsien Tsai
...
Min Tang
Xu Tan
Yanqing Liu
Sheng Zhao
Naoyuki Kanda
VLM
44
62
0
26 Jun 2024
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via
  Monotonic Alignment
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Bing Han
Long Zhou
Shujie Liu
Sanyuan Chen
Lingwei Meng
Yanming Qian
Yanqing Liu
Sheng Zhao
Jinyu Li
Furu Wei
81
21
0
12 Jun 2024
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and
  Benchmark
EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Ziyang Ma
Mingjie Chen
Hezhao Zhang
Zhisheng Zheng
Wenxi Chen
Xiquan Li
Jiaxin Ye
Xie Chen
Thomas Hain
79
17
0
11 Jun 2024
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
VLM
61
76
0
08 Jun 2024
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou
Jiawei Chen
Jingshu Chen
Yuanzhe Chen
Zhuo Chen
...
Wenjie Zhang
Yanzhe Zhang
Zilin Zhao
Dejian Zhong
Xiaobin Zhuang
75
90
0
04 Jun 2024
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting
  for Text-to-Speech Synthesis
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Detai Xin
Xu Tan
Kai Shen
Zeqian Ju
Dongchao Yang
...
Shinnosuke Takamichi
Hiroshi Saruwatari
Shujie Liu
Jinyu Li
Sheng Zhao
49
27
0
04 Apr 2024
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and
  Diffusion Models
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Zeqian Ju
Yuancheng Wang
Kai Shen
Xu Tan
Detai Xin
...
Shikun Zhang
Jiang Bian
Lei He
Jinyu Li
Sheng Zhao
DiffM
61
164
0
05 Mar 2024
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech
Chenpeng Du
Yiwei Guo
Hankun Wang
Yifan Yang
Zhikang Niu
Shuai Wang
Hui Zhang
Xie Chen
Kai Yu
VLM
53
30
0
25 Jan 2024
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided
  Sequence Reordering
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Ya-Zhen Song
Zhuo Chen
Xiaofei Wang
Ziyang Ma
Xie Chen
AuLLM
89
42
0
14 Jan 2024
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for
  Enhanced Time-Domain Monaural Speech Separation
MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation
Shengkui Zhao
Yukun Ma
Chongjia Ni
Chong Zhang
Hao Wang
Trung Hieu Nguyen
Kun Zhou
J. Yip
Dianwen Ng
Bin Ma
55
25
0
19 Dec 2023
SECap: Speech Emotion Captioning with Large Language Model
SECap: Speech Emotion Captioning with Large Language Model
Yaoxun Xu
Hangting Chen
Jianwei Yu
Qiaochu Huang
Zhiyong Wu
Shixiong Zhang
Guangzhi Li
Yi Luo
Rongzhi Gu
43
27
0
16 Dec 2023
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Yuan Gao
Nobuyuki Morioka
Yu Zhang
Nanxin Chen
DiffM
55
30
0
02 Nov 2023
Finite Scalar Quantization: VQ-VAE Made Simple
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer
David C. Minnen
E. Agustsson
Michael Tschannen
68
164
0
27 Sep 2023
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Yiwei Guo
Chenpeng Du
Ziyang Ma
Xie Chen
K. Yu
DiffM
50
42
0
10 Sep 2023
Matcha-TTS: A fast TTS architecture with conditional flow matching
Matcha-TTS: A fast TTS architecture with conditional flow matching
Shivam Mehta
Ruibo Tu
Jonas Beskow
Éva Székely
G. Henter
46
84
0
06 Sep 2023
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
  Resynthesis
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis
Tu Nguyen
Wei-Ning Hsu
Antony DÁvirro
Bowen Shi
Itai Gat
...
Gabriel Synnaeve
Michael Hassid
Felix Kreuk
Yossi Adi
Emmanuel Dupoux
55
59
0
10 Aug 2023
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Matt Le
Apoorv Vyas
Bowen Shi
Brian Karrer
Leda Sari
...
Mary Williamson
Vimal Manohar
Yossi Adi
Jay Mahadeokar
Wei-Ning Hsu
AuLLM
71
290
0
23 Jun 2023
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker
  Verification
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification
Yafeng Chen
Siqi Zheng
Haibo Wang
Luyao Cheng
Qian Chen
Jiajun Qi
42
43
0
22 May 2023
FunASR: A Fundamental End-to-End Speech Recognition Toolkit
FunASR: A Fundamental End-to-End Speech Recognition Toolkit
Zhifu Gao
Zerui Li
Jiaming Wang
Haoneng Luo
Xian Shi
...
Yabin Li
Lingyun Zuo
Zhihao Du
Zhangyu Xiao
Shiliang Zhang
57
61
0
18 May 2023
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal
  Supervision
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Eugene Kharitonov
Damien Vincent
Zalan Borsos
Raphaël Marinier
Sertan Girgin
Olivier Pietquin
Matthew Sharifi
Marco Tagliasacchi
Neil Zeghidour
58
198
0
07 Feb 2023
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang
Sanyuan Chen
Yu-Huan Wu
Zi-Hua Zhang
Long Zhou
...
Huaming Wang
Jinyu Li
Lei He
Sheng Zhao
Furu Wei
152
683
0
05 Jan 2023
Scalable Diffusion Models with Transformers
Scalable Diffusion Models with Transformers
William S. Peebles
Saining Xie
GNN
66
2,182
0
19 Dec 2022
Robust Speech Recognition via Large-Scale Weak Supervision
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford
Jong Wook Kim
Tao Xu
Greg Brockman
C. McLeavey
Ilya Sutskever
OffRL
111
3,515
0
06 Dec 2022
FLEURS: Few-shot Learning Evaluation of Universal Representations of
  Speech
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Alexis Conneau
Min Ma
Simran Khanuja
Yu Zhang
Vera Axelrod
Siddharth Dalmia
Jason Riesa
Clara E. Rivera
Ankur Bapna
VLM
102
306
0
25 May 2022
Large-scale Self-Supervised Speech Representation Learning for Automatic
  Speaker Verification
Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
Zhengyang Chen
Sanyuan Chen
Yu-Huan Wu
Yao Qian
Chengyi Wang
Shujie Liu
Y. Qian
Michael Zeng
SSL
38
126
0
12 Oct 2021
DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric
  to Evaluate Noise Suppressors
DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
Chandan K. A. Reddy
Vishak Gopal
Ross Cutler
64
213
0
05 Oct 2021
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
  for Self-Supervised Speech Pre-Training
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Yu-An Chung
Yu Zhang
Wei Han
Chung-Cheng Chiu
James Qin
Ruoming Pang
Yonghui Wu
SSL
VLM
23
421
0
07 Aug 2021
SoundStream: An End-to-End Neural Audio Codec
SoundStream: An End-to-End Neural Audio Codec
Neil Zeghidour
Alejandro Luebs
Ahmed Omran
Jan Skoglund
Marco Tagliasacchi
AI4TS
71
760
0
07 Jul 2021
HuBERT: Self-Supervised Speech Representation Learning by Masked
  Prediction of Hidden Units
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Wei-Ning Hsu
Benjamin Bolte
Yao-Hung Hubert Tsai
Kushal Lakhotia
Ruslan Salakhutdinov
Abdel-rahman Mohamed
SSL
127
2,879
0
14 Jun 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
139
2,307
0
20 Apr 2021
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
Yi Ren
Chenxu Hu
Xu Tan
Tao Qin
Sheng Zhao
Zhou Zhao
Tie-Yan Liu
90
1,382
0
08 Jun 2020
Common Voice: A Massively-Multilingual Speech Corpus
Common Voice: A Massively-Multilingual Speech Corpus
Rosana Ardila
Megan Branson
Kelly Davis
Michael Henretty
M. Kohler
Josh Meyer
Reuben Morais
Lindsay Saunders
Francis M. Tyers
Gregor Weber
VLM
61
1,571
0
13 Dec 2019
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
Ming-Yu Liu
Kainan Peng
Jitong Chen
39
344
0
19 Jul 2018
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram
  Predictions
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Jonathan Shen
Ruoming Pang
Ron J. Weiss
M. Schuster
Navdeep Jaitly
...
Yuxuan Wang
RJ Skerry-Ryan
Rif A. Saurous
Yannis Agiomyrgiannakis
Yonghui Wu
68
2,684
0
16 Dec 2017
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence
  Learning
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Ming-Yu Liu
Kainan Peng
Andrew Gibiansky
Sercan O. Arik
Ajay Kannan
Sharan Narang
Jonathan Raiman
John Miller
57
304
0
20 Oct 2017
Tacotron: Towards End-to-End Speech Synthesis
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang
RJ Skerry-Ryan
Daisy Stanton
Yonghui Wu
Ron J. Weiss
...
Samy Bengio
Quoc V. Le
Yannis Agiomyrgiannakis
R. Clark
Rif A. Saurous
130
1,817
0
29 Mar 2017
1