Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs

This paper explores syllable sequence prediction in Abugida languages using Transformer-based models, focusing on six languages: Bengali, Hindi, Khmer, Lao, Myanmar, and Thai, from the Asian Language Treebank (ALT) dataset. We investigate the reconstruction of complete syllable sequences from various incomplete input types, including consonant sequences, vowel sequences, partial syllables (with random character deletions), and masked syllables (with fixed syllable deletions). Our experiments reveal that consonant sequences play a critical role in accurate syllable prediction, achieving high BLEU scores, while vowel sequences present a significantly greater challenge. The model demonstrates robust performance across tasks, particularly in handling partial and masked syllable reconstruction, with strong results for tasks involving consonant information and syllable masking. This study advances the understanding of sequence prediction for Abugida languages and provides practical insights for applications such as text prediction, spelling correction, and data augmentation in these scripts.
View on arXiv@article{thu2025_2505.11008, title={ Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs }, author={ Ye Kyaw Thu and Thazin Myint Oo }, journal={arXiv preprint arXiv:2505.11008}, year={ 2025 } }