Behind Maya: Building a Multilingual Vision Language Model

13 May 2025

Abstract

In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available atthis https URL.

View on arXiv

@article{alam2025_2505.08910,
  title={ Behind Maya: Building a Multilingual Vision Language Model },
  author={ Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth.S and Snehanshu Mukherjee and Alham Fikri Aji },
  journal={arXiv preprint arXiv:2505.08910},
  year={ 2025 }
}

Comments on this paper