97
0

Sample Complexity and Representation Ability of Test-time Scaling Paradigms

Main:12 Pages
6 Figures
Bibliography:8 Pages
2 Tables
Appendix:21 Pages
Abstract

Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies -- such as self-consistency, best-of-nn, and self-correction -- remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires Θ(1/Δ2)\Theta(1/\Delta^2) samples to produce the correct answer, while best-of-nn only needs Θ(1/Δ)\Theta(1/\Delta), where Δ<1\Delta < 1 denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.

View on arXiv
@article{huang2025_2506.05295,
  title={ Sample Complexity and Representation Ability of Test-time Scaling Paradigms },
  author={ Baihe Huang and Shanda Li and Tianhao Wu and Yiming Yang and Ameet Talwalkar and Kannan Ramchandran and Michael I. Jordan and Jiantao Jiao },
  journal={arXiv preprint arXiv:2506.05295},
  year={ 2025 }
}
Comments on this paper