Outcome-based Reinforcement Learning to Predict the Future

23 May 2025

Benjamin Turtel

Main:12 Pages

4 Figures

Bibliography:2 Pages

1 Tables

Abstract

Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this calibration edge into \ $127 of hypothetical profit versus \$ 92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

View on arXiv

@article{turtel2025_2505.17989,
  title={ Outcome-based Reinforcement Learning to Predict the Future },
  author={ Benjamin Turtel and Danny Franklin and Kris Skotheim and Luke Hewitt and Philipp Schoenegger },
  journal={arXiv preprint arXiv:2505.17989},
  year={ 2025 }
}

Comments on this paper