Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Alena Butryna
Shan-Hui Cathy Chu
Isin Demirsahin
Alexander Gutkin
Linne Ha
Fei He
Martin Jansche
Cibu Johny
Anna Katanova
Oddur Kjartansson
Chenfang Li
Tatiana Merkulova
Yin May Oo
Knot Pipatsrisawat
Clara E. Rivera
Supheakmungkol Sarin
Pasindu De Silva
Keshan Sanjaya Sodimana
R. Sproat
T. Wattanavekin
Jaka Aris Eko Wibawa

Abstract
This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.
View on arXivComments on this paper