Sylber: Syllabic Embedding Representation of Speech from Raw Audio

9 October 2024

Cheol Jun Cho

Nicholas Lee

Akshat Gupta

Dhruv Agarwal

Ethan Chen

Alan W Black

Gopala K. Anumanchipalli

ArXiv (abs)PDF HTML

Abstract

Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

View on arXiv

@article{cho2025_2410.07168,
  title={ Sylber: Syllabic Embedding Representation of Speech from Raw Audio },
  author={ Cheol Jun Cho and Nicholas Lee and Akshat Gupta and Dhruv Agarwal and Ethan Chen and Alan W Black and Gopala K. Anumanchipalli },
  journal={arXiv preprint arXiv:2410.07168},
  year={ 2025 }
}

Comments on this paper