Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Reinforcement learning has emerged as an effective framework for training large language models on structured language-conditioned tasks. We identify a critical flaw of Group Relative Policy Optimization (GRPO), a widely used RL algorithm in this setting. For tasks that require multi-sample performance, such as formal theorem proving, GRPO biasedly reinforces already probable solutions and neglects rare but correct proofs. This implicit bias impairs performance on pass@ metrics at large sample sizes, limiting its practicality for training theorem provers. To address this, we introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions. Additionally, we find that increasing the number of PPO epochs further mitigates this bias. Our experiments confirm that incorporating the unlikeliness reward significantly improves pass@ across a large range of N, outperforming standard GRPO and substantially increasing sample diversity. Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation, providing a simple yet effective recipe for training formal theorem provers with RL.
View on arXiv@article{he2025_2506.02355, title={ Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening }, author={ Andre He and Daniel Fried and Sean Welleck }, journal={arXiv preprint arXiv:2506.02355}, year={ 2025 } }