89
0

CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

Main:1 Pages
8 Figures
Bibliography:2 Pages
8 Tables
Appendix:14 Pages
Abstract

Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.

View on arXiv
@article{villa-cueva2025_2505.24456,
  title={ CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation },
  author={ Emilio Villa-Cueva and Sholpan Bolatzhanova and Diana Turmakhan and Kareem Elzeky and Henok Biadglign Ademtew and Alham Fikri Aji and Israel Abebe Azime and Jinheon Baek and Frederico Belcavello and Fermin Cristobal and Jan Christian Blaise Cruz and Mary Dabre and Raj Dabre and Toqeer Ehsan and Naome A Etori and Fauzan Farooqui and Jiahui Geng and Guido Ivetta and Thanmay Jayakumar and Soyeong Jeong and Zheng Wei Lim and Aishik Mandal and Sofia Martinelli and Mihail Minkov Mihaylov and Daniil Orel and Aniket Pramanick and Sukannya Purkayastha and Israfel Salazar and Haiyue Song and Tiago Timponi Torrent and Debela Desalegn Yadeta and Injy Hamed and Atnafu Lambebo Tonja and Thamar Solorio },
  journal={arXiv preprint arXiv:2505.24456},
  year={ 2025 }
}
Comments on this paper