54

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan
Christopher Driggers-Ellis
Christan Grant
Daisy Zhe Wang
Main:6 Pages
Bibliography:2 Pages
3 Tables
Abstract

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages.

View on arXiv
Comments on this paper