ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.16679
31
3

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

31 July 2023
Guangyan Zhang
Thomas Merritt
M. Ribeiro
Biel Tura Vecino
K. Yanagisawa
Kamil Pokora
Abdelhamid Ezzerg
Sebastian Cygert
Ammar Abbas
Piotr Bilinski
Roberto Barra-Chicote
Daniel Korzekwa
Jaime Lorenzo-Trueba
    DiffM
ArXivPDFHTML
Abstract

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

View on arXiv
Comments on this paper