MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

20 April 2026

Huakang Chen

Jingbin Hu

Liumeng Xue

Qirui Zhan

Wenhao Li

Guobin Ma

Hanke Xie

Dake Guo

Linhan Ma

Yuepeng Jiang

Bengu Wu

Pengyuan Xie

Chuan Xie

Qiang Zhang

Lei Xie

AuLLM

ELM

ArXiv (abs)PDF HTML Github (224★)

Main:9 Pages

12 Figures

Bibliography:2 Pages

17 Tables

Appendix:12 Pages

Abstract

Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, and insufficient multilingual support. We present \textbf{MINT-Bench}, a comprehensive multilingual benchmark for instruction-following TTS. MINT-Bench is built upon a hierarchical multi-axis taxonomy, a scalable multi-stage data construction pipeline, and a hierarchical hybrid evaluation protocol that jointly assesses content consistency, instruction following, and perceptual quality. Experiments across ten languages show that current systems remain far from solved: frontier commercial systems lead overall, while leading open-source models become highly competitive and can even outperform commercial counterparts in localized settings such as Chinese. The benchmark further reveals that harder compositional and paralinguistic controls remain major bottlenecks for current systems. We release MINT-Bench together with the data construction and evaluation toolkit to support future research on controllable, multilingual, and diagnostically grounded TTS evaluation. The leaderboard and demo are available atthis https URL

View on arXiv

Comments on this paper