ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.05794
54
0

Phonetically rich corpus construction for a low-resourced language

8 February 2024
Marcellus Amadeus
William Alberto Cruz Castañeda
Wilmer Lobato
Niasche Aquino
ArXiv (abs)PDFHTML
Abstract

Speech technologies rely on capturing a speaker's voice variability while obtaining comprehensive language information. Textual prompts and sentence selection methods have been proposed in the literature to comprise such adequate phonetic data, referred to as a phonetically rich \textit{corpus}. However, they are still insufficient for acoustic modeling, especially critical for languages with limited resources. Hence, this paper proposes a novel approach and outlines the methodological aspects required to create a \textit{corpus} with broad phonetic coverage for a low-resourced language, Brazilian Portuguese. Our methodology includes text dataset collection up to a sentence selection algorithm based on triphone distribution. Furthermore, we propose a new phonemic classification according to acoustic-articulatory speech features since the absolute number of distinct triphones, or low-probability triphones, does not guarantee an adequate representation of every possible combination. Using our algorithm, we achieve a 55.8\% higher percentage of distinct triphones -- for samples of similar size -- while the currently available phonetic-rich corpus, CETUC and TTS-Portuguese, 12.6\% and 12.3\% in comparison to a non-phonetically rich dataset.

View on arXiv
Comments on this paper