Factorized RVQ-GAN For Disentangled Speech Tokenization

18 June 2025

Sameer Khurana

Dominik Klement

Antoine Laurent

Dominik Bobos

Juraj Novosad

Peter Gazdik

Ellen Zhang

Zili Huang

Amir Hussein

Ricard Marxer

Yoshiki Masuyama

Ryo Aihara

Chiori Hori

Francois G. Germain

Gordon Wichern

Jonathan Le Roux

Author Contacts:

sameerkhurana10@gmail.com xkleme15@vutbr.cz leroux@merl.com

ArXiv (abs)PDF HTML

Main:4 Pages

6 Figures

Bibliography:1 Pages

1 Tables

Abstract

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

View on arXiv

@article{khurana2025_2506.15456,
  title={ Factorized RVQ-GAN For Disentangled Speech Tokenization },
  author={ Sameer Khurana and Dominik Klement and Antoine Laurent and Dominik Bobos and Juraj Novosad and Peter Gazdik and Ellen Zhang and Zili Huang and Amir Hussein and Ricard Marxer and Yoshiki Masuyama and Ryo Aihara and Chiori Hori and Francois G. Germain and Gordon Wichern and Jonathan Le Roux },
  journal={arXiv preprint arXiv:2506.15456},
  year={ 2025 }
}

Comments on this paper