57
3

ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning

Main:8 Pages
7 Figures
Bibliography:4 Pages
7 Tables
Appendix:5 Pages
Abstract

Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.

View on arXiv
@article{huang2025_2502.16268,
  title={ ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning },
  author={ Shulin Huang and Linyi Yang and Yan Song and Shuang Chen and Leyang Cui and Ziyu Wan and Qingcheng Zeng and Ying Wen and Kun Shao and Weinan Zhang and Jun Wang and Yue Zhang },
  journal={arXiv preprint arXiv:2502.16268},
  year={ 2025 }
}
Comments on this paper