Ranking Reasoning LLMs under Test-Time Scaling

11 March 2026

Mohsen Hariri

Michael Hinczewski

Jing Ma

Vipin Chaudhary

LRM

ArXiv (abs)PDF HTML Github (5★)

Main:9 Pages

7 Figures

Bibliography:4 Pages

24 Tables

Appendix:28 Pages

Abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $\tau_b = 0.93$ -- $0.95$ ), and $19$ -- $34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $\tau_b \approx 0.86$ . Using greedy decoding as an empirical prior ( $\mathrm{Bayes}_{\mathbf{R}_0}@N$ ) reduces variance at $N=1$ by $16$ -- $52\%$ , but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library atthis https URL.

View on arXiv

Comments on this paper