36

Improving Methodologies for LLM Evaluations Across Global Languages

Akriti Vij
Benjamin Chua
Darshini Ramiah
En Qi Ng
Mahran Morsidi
Naga Nikshith Gangarapu
Sharmini Johnson
Vanessa Wilfred
Vikneswaran Kumaran
Wan Sie Lee
Wenzhuo Yang
Yongsen Zheng
Bill Black
Boming Xia
Frank Sun
Hao Zhang
Qinghua Lu
Suyu Ma
Yue Liu
Chi-kiu Lo
Fatemeh Azadi
Isar Nejadgholi
Sowmya Vajjala
Agnes Delaborde
Nicolas Rolin
Tom Seimandi
Akiko Murakami
Haruto Ishi
Satoshi Sekine
Takayuki Semitsu
Tasuku Sasaki
Angela Kinuthia
Jean Wangari
Michael Michie
Stephanie Kasaon
Hankyul Baek
Jaewon Noh
Kihyuk Nam
Sang Seo
Sungpil Shin
Taewhi Lee
Yongsu Kim
Daisy Newbold-Harrop
Jessica Wang
Mahmoud Ghanem
Vy Hong
Main:69 Pages
9 Figures
9 Tables
Abstract

As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation.The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.

View on arXiv
Comments on this paper