ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.14245
13
0

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

17 June 2025
Xumeng Wen
Zihan Liu
Shun Zheng
Zhijian Xu
Shengyu Ye
Zhirong Wu
Xiao Liang
Yang Wang
Junjie Li
Ziming Miao
Jiang Bian
Mao Yang
    LRM
ArXiv (abs)PDFHTML
Main:10 Pages
6 Figures
Bibliography:5 Pages
Appendix:14 Pages
Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@KPass@KPass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@KPass@KPass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoTCoTCoT-Pass@KPass@KPass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoTCoTCoT-Pass@KPass@KPass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of KKK. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

View on arXiv
@article{wen2025_2506.14245,
  title={ Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs },
  author={ Xumeng Wen and Zihan Liu and Shun Zheng and Zhijian Xu and Shengyu Ye and Zhirong Wu and Xiao Liang and Yang Wang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang },
  journal={arXiv preprint arXiv:2506.14245},
  year={ 2025 }
}
Comments on this paper