ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.05370
  4. Cited By
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers
v1v2 (latest)

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

8 June 2024
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
    VLM
ArXiv (abs)PDFHTML

Papers citing "VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers"

31 / 31 papers shown
Title
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du
Changfeng Gao
Yuxuan Wang
Fan Yu
Tianyu Zhao
...
Mengzhe Chen
Yafeng Chen
Shiliang Zhang
Wen Wang
Jieping Ye
AuLLM
121
1
0
23 May 2025
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun
Ruitong Xiao
Jianye Mo
Bowen Wu
Qun Yu
Baoxun Wang
89
3
0
03 Apr 2025
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Qianniu Chen
Xiaoyang Hao
Yangqiu Song
Yunxing Liu
Li Lu
71
0
0
15 Jan 2025
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
Jinzuomu Zhong
Korin Richmond
Zhiba Su
Siqi Sun
116
6
0
10 Jan 2025
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Bohan Li
Hankun Wang
Situo Zhang
Yiwei Guo
Kai Yu
100
9
0
29 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLMVLM
127
5
0
20 Oct 2024
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
Xuyuan Li
Zengqiang Shang
Hua Hua
Peiyang Shi
Chen Yang
Li Wang
Pengyuan Zhang
134
3
0
16 Oct 2024
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
121
90
0
09 Oct 2024
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
Ismail Rasim Ulgen
Shreeram Suresh Chandra
Junchen Lu
Berrak Sisman
435
1
0
30 Aug 2024
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Shengpeng Ji
Ziyue Jiang
Xize Cheng
Yifu Chen
Minghui Fang
...
Rongjie Huang
Yidi Jiang
Qian Chen
Zhou Zhao
Zhou Zhao
VLM
129
44
0
29 Aug 2024
Autoregressive Speech Synthesis without Vector Quantization
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
Long Zhou
Shujie Liu
Sanyuan Chen
Bing Han
...
Jinyu Li
Sheng Zhao
Xixin Wu
Helen M. Meng
Furu Wei
141
42
0
11 Jul 2024
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting
  for Text-to-Speech Synthesis
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Detai Xin
Xu Tan
Kai Shen
Zeqian Ju
Dongchao Yang
...
Shinnosuke Takamichi
Hiroshi Saruwatari
Shujie Liu
Jinyu Li
Sheng Zhao
59
28
0
04 Apr 2024
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
  on 100K hours of data
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz Lajszczak
Guillermo Cámbara
Yang Li
Fatih Beyhan
Arent van Korlaar
...
Bartosz Putrycz
Soledad López Gambino
Kayeon Yoo
Elena Sokolova
Thomas Drugman
LM&MA
89
88
0
12 Feb 2024
Audiobox: Unified Audio Generation with Natural Language Prompts
Audiobox: Unified Audio Generation with Natural Language Prompts
Apoorv Vyas
Bowen Shi
Matt Le
Andros Tjandra
Yi-Chiao Wu
...
Chris Summers
Carleigh Wood
Joshua Lane
Mary Williamson
Wei-Ning Hsu
113
94
0
25 Dec 2023
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Dongchao Yang
Jinchuan Tian
Xuejiao Tan
Rongjie Huang
Songxiang Liu
...
Jiang Bian
Xixin Wu
Zhou Zhao
Shinji Watanabe
Helen M. Meng
CVBMAuLLM
84
128
0
01 Oct 2023
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit
  for Neural Speech Codec
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Zhihao Du
Shiliang Zhang
Kai Hu
Siqi Zheng
60
63
0
14 Sep 2023
UniCATS: A Unified Context-Aware Text-to-Speech Framework with
  Contextual VQ-Diffusion and Vocoding
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
Chenpeng Du
Yiwei Guo
Feiyu Shen
Zhijun Liu
Zheng Liang
Xie Chen
Shuai Wang
Hui Zhang
K. Yu
DiffM
69
44
0
13 Jun 2023
High-Fidelity Audio Compression with Improved RVQGAN
High-Fidelity Audio Compression with Improved RVQGAN
Rithesh Kumar
Prem Seetharaman
Alejandro Luebs
I. Kumar
Kundan Kumar
98
334
0
11 Jun 2023
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
  Bias
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Ziyue Jiang
Yi Ren
Zhe Ye
Jinglin Liu
Chen Zhang
...
Rongjie Huang
Chunfeng Wang
Xiang Yin
Zejun Ma
Zhou Zhao
DiffM
95
80
0
06 Jun 2023
PolyVoice: Language Models for Speech to Speech Translation
PolyVoice: Language Models for Speech to Speech Translation
Qianqian Dong
Zhiying Huang
Qiao Tian
Chen Xu
Tom Ko
...
Lu Lu
Zejun Ma
Yuping Wang
Mingxuan Wang
Yuxuan Wang
84
24
0
05 Jun 2023
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
  and Translation
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Tianrui Wang
Long Zhou
Zi-Hua Zhang
Yu-Huan Wu
Shujie Liu
Yashesh Gaur
Zhuo Chen
Jinyu Li
Furu Wei
81
105
0
25 May 2023
AudioLM: a Language Modeling Approach to Audio Generation
AudioLM: a Language Modeling Approach to Audio Generation
Zalan Borsos
Raphaël Marinier
Damien Vincent
Eugene Kharitonov
Olivier Pietquin
...
Dominik Roblek
O. Teboul
David Grangier
Marco Tagliasacchi
Neil Zeghidour
AuLLM
161
615
0
07 Sep 2022
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
  Quality
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
Xu Tan
Jiawei Chen
Haohe Liu
Jian Cong
Chen Zhang
...
Lei He
Frank Soong
Tao Qin
Sheng Zhao
Tie-Yan Liu
109
221
0
09 May 2022
MaskGIT: Masked Generative Image Transformer
MaskGIT: Masked Generative Image Transformer
Huiwen Chang
Han Zhang
Lu Jiang
Ce Liu
William T. Freeman
ViT
153
695
0
08 Feb 2022
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
  Conversion for everyone
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Edresson Casanova
Julian Weber
C. Shulby
Arnaldo Cândido Júnior
Eren Golge
M. Ponti
234
414
0
04 Dec 2021
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
  Processing
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan Chen
Chengyi Wang
Zhengyang Chen
Yu-Huan Wu
Shujie Liu
...
Yao Qian
Jian Wu
Micheal Zeng
Xiangzhan Yu
Furu Wei
SSL
265
1,898
0
26 Oct 2021
DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to
  evaluate Noise Suppressors
DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors
Chandan K. A. Reddy
Vishak Gopal
Ross Cutler
80
313
0
28 Oct 2020
Conformer: Convolution-augmented Transformer for Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati
James Qin
Chung-Cheng Chiu
Niki Parmar
Yu Zhang
...
Wei Han
Shibo Wang
Zhengdong Zhang
Yonghui Wu
Ruoming Pang
229
3,155
0
16 May 2020
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Jacob Kahn
M. Rivière
Weiyi Zheng
Evgeny Kharitonov
Qiantong Xu
...
Tatiana Likhomanenko
Gabriel Synnaeve
Armand Joulin
Abdel-rahman Mohamed
Emmanuel Dupoux
AuLLM
75
673
0
17 Dec 2019
Sample Efficient Adaptive Text-to-Speech
Sample Efficient Adaptive Text-to-Speech
Yutian Chen
Yannis Assael
Brendan Shillingford
David Budden
Scott E. Reed
...
Ben Laurie
Çağlar Gülçehre
Aaron van den Oord
Oriol Vinyals
Nando de Freitas
88
149
0
27 Sep 2018
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram
  Predictions
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Jonathan Shen
Ruoming Pang
Ron J. Weiss
M. Schuster
Navdeep Jaitly
...
Yuxuan Wang
RJ Skerry-Ryan
Rif A. Saurous
Yannis Agiomyrgiannakis
Yonghui Wu
85
2,703
0
16 Dec 2017
1