ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.19877
48
2

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

25 March 2025
Seungone Kim
Ian Wu
Jinu Lee
Xiang Yue
Seongyun Lee
Mingyeong Moon
Kiril Gashteovski
Carolin (Haas) Lawrence
J. Hockenmaier
Graham Neubig
Sean Welleck
    LRM
ArXivPDFHTML
Abstract

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

View on arXiv
@article{kim2025_2503.19877,
  title={ Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators },
  author={ Seungone Kim and Ian Wu and Jinu Lee and Xiang Yue and Seongyun Lee and Mingyeong Moon and Kiril Gashteovski and Carolin Lawrence and Julia Hockenmaier and Graham Neubig and Sean Welleck },
  journal={arXiv preprint arXiv:2503.19877},
  year={ 2025 }
}
Comments on this paper