Speaker consistency loss and step-wise optimization for semi-supervised
joint training of TTS and ASR using unpaired text data

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

11 July 2022

Naoki Makishima

Atsushi Ando

ArXiv (abs)PDF HTML

Papers citing "Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data"

13 / 13 papers shown

Title
Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 Zackary Rackauckas Julia Hirschberg 40 0 0 22 May 2025
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS Ye Jia Heiga Zen Jonathan Shen Yu Zhang Yonghui Wu SSL 96 84 0 28 Mar 2021
Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech C. Chien Jheng-hao Lin Chien-yu Huang Po-Chun Hsu Hung-yi Lee 105 70 0 06 Mar 2021
TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis Min-Jae Hwang Ryuichi Yamamoto Eunwoo Song Jae-Min Kim 44 32 0 26 Oct 2020
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech Yi Ren Chenxu Hu Xu Tan Tao Qin Sheng Zhao Zhou Zhao Tie-Yan Liu 105 1,411 0 08 Jun 2020
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint Zexin Cai Chuxiong Zhang Ming Li 64 42 0 10 May 2020
Speech Recognition with Augmented Synthesized Speech Andrew Rosenberg Yu Zhang Bhuvana Ramabhadran Ye Jia Pedro J. Moreno Yonghui Wu Zelin Wu 67 128 0 25 Sep 2019
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech Heiga Zen Viet Dang R. Clark Yu Zhang Ron J. Weiss Ye Jia Zhiwen Chen Yonghui Wu 104 959 0 05 Apr 2019
Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis Yu-An Chung Yuxuan Wang Wei-Ning Hsu Yu Zhang RJ Skerry-Ryan 84 117 0 30 Aug 2018
VoxCeleb2: Deep Speaker Recognition Joon Son Chung Arsha Nagrani Andrew Zisserman 356 2,287 0 14 Jun 2018
Attentive Statistics Pooling for Deep Speaker Embedding K. Okabe Takafumi Koshinaka Koichi Shinoda 113 531 0 29 Mar 2018
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions Jonathan Shen Ruoming Pang Ron J. Weiss M. Schuster Navdeep Jaitly ... Yuxuan Wang RJ Skerry-Ryan Rif A. Saurous Yannis Agiomyrgiannakis Yonghui Wu 85 2,704 0 16 Dec 2017
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks Samy Bengio Oriol Vinyals Navdeep Jaitly Noam M. Shazeer 154 2,039 0 09 Jun 2015