T2MBench: A Benchmark for Out-of-Distribution Text-to-Motion Generation

14 February 2026

Bin Yang

Rong Ou

Weisheng Xu

Jiaqi Xiong

Xintao Li

Taowen Wang

Luyu Zhu

Xu Jiang

Jing Tan

Renjing Xu

EGVM

VGen

ArXiv (abs)PDF HTML

Main:8 Pages

14 Figures

Bibliography:2 Pages

21 Tables

Appendix:22 Pages

Abstract

Most existing evaluations of text-to-motion generation focus on in-distribution textual inputs and a limited set of evaluation criteria, which restricts their ability to systematically assess model generalization and motion generation capabilities under complex out-of-distribution (OOD) textual conditions. To address this limitation, we propose a benchmark specifically designed for OOD text-to-motion evaluation, which includes a comprehensive analysis of 14 representative baseline models and the two datasets derived from evaluation results. Specifically, we construct an OOD prompt dataset consisting of 1,025 textual descriptions. Based on this prompt dataset, we introduce a unified evaluation framework that integrates LLM-based Evaluation, Multi-factor Motion evaluation, and Fine-grained Accuracy Evaluation. Our experimental results reveal that while different baseline models demonstrate strengths in areas such as text-to-motion semantic alignment, motion generalizability, and physical quality, most models struggle to achieve strong performance with Fine-grained Accuracy Evaluation. These findings highlight the limitations of existing methods in OOD scenarios and offer practical guidance for the design and evaluation of future production-level text-to-motion models.

View on arXiv

Comments on this paper