66

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri
Michael Hinczewski
Jing Ma
Vipin Chaudhary
Main:9 Pages
7 Figures
Bibliography:4 Pages
24 Tables
Appendix:28 Pages
Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across 2020 reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to N=80N=80 trials), most full-trial rankings agree closely with the Bayesian gold standard BayesU@80\mathrm{Bayes}_{\mathcal{U}}@80 (mean Kendall's τb=0.93\tau_b = 0.93--0.950.95), and 1919--3434 methods recover exactly the same ordering. In the single-trial regime, the best methods reach τb0.86\tau_b \approx 0.86. Using greedy decoding as an empirical prior (BayesR0@N\mathrm{Bayes}_{\mathbf{R}_0}@N) reduces variance at N=1N=1 by 1616--52%52\%, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library atthis https URL.

View on arXiv
Comments on this paper