User-Driven Voice Generation and Editing through Latent Space Navigation

30 August 2024

Yusheng Tian

Main:4 Pages

6 Figures

Bibliography:2 Pages

2 Tables

Abstract

This paper presents a user-driven approach for synthesizing specific target voices based on user feedback rather than reference recordings, which is particularly beneficial for speech-impaired individuals who want to recreate their lost voices but lack prior recordings. Our method leverages the neural analysis and synthesis framework to construct a latent speaker embedding space. Within this latent space, a human-in-the-loop search algorithm guides the voice generation process. Users participate in a series of straightforward listening-and-comparison tasks, providing feedback that iteratively refines the synthesized voice to match their desired target. Both computer simulations and real-world user studies demonstrate that the proposed approach can effectively approximate target voices. Moreover, by analyzing the mel-spectrogram generator's Jacobians, we identify a set of meaningful voice editing directions within the latent space. These directions enable users to further fine-tune specific attributes of the generated voice, including the pitch level, pitch range, volume, vocal tension, nasality, and tone color. Audio samples are available at https://myspeechprojects.github.io/voicedesign/.

View on arXiv

@article{tian2025_2408.17068,
  title={ Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent },
  author={ Yusheng Tian and Junbin Liu and Tan Lee },
  journal={arXiv preprint arXiv:2408.17068},
  year={ 2025 }
}

Comments on this paper