ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.10947
98
8

Spurious Rewards: Rethinking Training Signals in RLVR

12 June 2025
Rulin Shao
Shuyue Stella Li
Rui Xin
Scott Geng
Yiping Wang
Sewoong Oh
S. Du
Nathan Lambert
Sewon Min
Ranjay Krishna
Yulia Tsvetkov
Hannaneh Hajishirzi
Pang Wei Koh
Luke Zettlemoyer
    OffRLReLMLRM
ArXiv (abs)PDFHTML
Main:17 Pages
34 Figures
Bibliography:1 Pages
5 Tables
Appendix:25 Pages
Abstract

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

View on arXiv
@article{shao2025_2506.10947,
  title={ Spurious Rewards: Rethinking Training Signals in RLVR },
  author={ Rulin Shao and Shuyue Stella Li and Rui Xin and Scott Geng and Yiping Wang and Sewoong Oh and Simon Shaolei Du and Nathan Lambert and Sewon Min and Ranjay Krishna and Yulia Tsvetkov and Hannaneh Hajishirzi and Pang Wei Koh and Luke Zettlemoyer },
  journal={arXiv preprint arXiv:2506.10947},
  year={ 2025 }
}
Comments on this paper