VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation

30 May 2025

Main:9 Pages

7 Figures

Bibliography:1 Pages

6 Tables

Appendix:11 Pages

Abstract

Machine translation systems fail when processing code-mixed inputs for low-resource languages. We address this challenge by curating VietMix, a parallel corpus of naturally occurring code-mixed Vietnamese text paired with expert English translations. Augmenting this resource, we developed a complementary synthetic data generation pipeline. This pipeline incorporates filtering mechanisms to ensure syntactic plausibility and pragmatic appropriateness in code-mixing patterns. Experimental validation shows our naturalistic and complementary synthetic data boost models' performance, measured by translation quality estimation scores, of up to 71.84 on COMETkiwi and 81.77 on XCOMET. Triangulating positive results with LLM-based assessments, augmented models are favored over seed fine-tuned counterparts in approximately 49% of judgments (54-56% excluding ties). VietMix and our augmentation methodology advance ecological validity in neural MT evaluations and establish a framework for addressing code-mixed translation challenges across other low-resource pairs.

View on arXiv

@article{tran2025_2505.24472,
  title={ VietMix: A Naturally Occurring Vietnamese-English Code-Mixed Corpus with Iterative Augmentation for Machine Translation },
  author={ Hieu Tran and Phuong-Anh Nguyen-Le and Huy Nghiem and Quang-Nhan Nguyen and Wei Ai and Marine Carpuat },
  journal={arXiv preprint arXiv:2505.24472},
  year={ 2025 }
}

Comments on this paper