Value-Guided Search for Efficient Chain-of-Thought Reasoning

23 May 2025

Author Contacts:

kw437@cornell.edu jz563@cornell.edu

LRM

ArXiv (abs)PDF HTML

Main:9 Pages

20 Figures

Bibliography:4 Pages

6 Tables

Appendix:16 Pages

Abstract

In this paper, we propose a simple and efficient method for value model training on long-context reasoning traces. Compared to existing process reward models (PRMs), our method does not require a fine-grained notion of "step," which is difficult to define for long-context reasoning models. By collecting a dataset of 2.5 million reasoning traces, we train a 1.5B token-level value model and apply it to DeepSeek models for improved performance with test-time compute scaling. We find that block-wise value-guided search (VGS) with a final weighted majority vote achieves better test-time scaling than standard methods such as majority voting or best-of-n. With an inference budget of 64 generations, VGS with DeepSeek-R1-Distill-1.5B achieves an average accuracy of 45.7% across four competition math benchmarks (AIME 2024 & 2025, HMMT Feb 2024 & 2025), reaching parity with o3-mini-medium. Moreover, VGS significantly reduces the inference FLOPs required to achieve the same performance of majority voting. Our dataset, model and codebase are open-sourced.

View on arXiv

@article{wang2025_2505.17373,
  title={ Value-Guided Search for Efficient Chain-of-Thought Reasoning },
  author={ Kaiwen Wang and Jin Peng Zhou and Jonathan Chang and Zhaolin Gao and Nathan Kallus and Kianté Brantley and Wen Sun },
  journal={arXiv preprint arXiv:2505.17373},
  year={ 2025 }
}

Comments on this paper