ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.04013
144
0

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

4 June 2025
Seymanur Akti
T. Nguyen
Alexander Waibel
    DRL
ArXiv (abs)PDFHTML
Main:4 Pages
1 Figures
Bibliography:1 Pages
4 Tables
Abstract

Expressive voice conversion aims to transfer both speaker identity and expressive attributes from a target speech to a given source speech. In this work, we improve over a self-supervised, non-autoregressive framework with a conditional variational autoencoder, focusing on reducing source timbre leakage and improving linguistic-acoustic disentanglement for better style transfer. To minimize style leakage, we use multilingual discrete speech units for content representation and reinforce embeddings with augmentation-based similarity loss and mix-style layer normalization. To enhance expressivity transfer, we incorporate local F0 information via cross-attention and extract style embeddings enriched with global pitch and energy features. Experiments show our model outperforms baselines in emotion and speaker similarity, demonstrating superior style adaptation and reduced source style leakage.

View on arXiv
@article{akti2025_2506.04013,
  title={ Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion },
  author={ Seymanur Akti and Tuan Nam Nguyen and Alexander Waibel },
  journal={arXiv preprint arXiv:2506.04013},
  year={ 2025 }
}
Comments on this paper