Perceptually Guided End-to-End Text-to-Speech With MOS Prediction
Although recent end-to-end text-to-speech (TTS) systems have achieved high-quality synthesized speech, there are still several factors that degrade the quality of synthesized speech, including lack of training data or information loss during knowledge distillation. To address the problem, we propose a novel way to train a TTS model under the supervision of perceptual loss, which measures the distance between the maximum speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model in the direction of maximizing the MOS of synthesized speech predicted by the pre-trained MOS prediction model. Through this method, we can improve the quality of synthesized speech universally (i.e., regardless of the network architecture or the cause of the speech quality degradation) and efficiently (i.e., without increasing the inference time or the model complexity). The evaluation results for MOS and phoneme error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
View on arXiv