Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed the uncertain reward models to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data, and thus cannot sufficiently mitigate reward hacking to sustain prolonged RLHF training and exploration. In this paper, we propose a Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model, that can directly learn the reward distribution emerged from the preference data. We theoretically derived PURM's loss function and the uncertainty of the reward distribution. To mitigate reward hacking with PURM, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. Experimental results demonstrate that PURM significantly delays the onset of reward hacking while improving final performance compared with existing methods. We also find that PURM genuinely produce sound reward and uncertainty estimations. The data and code of this paper can be found atthis https URL
View on arXiv@article{sun2025_2503.22480, title={ Probabilistic Uncertain Reward Model }, author={ Wangtao Sun and Xiang Cheng and Xing Yu and Haotian Xu and Zhao Yang and Shizhu He and Jun Zhao and Kang Liu }, journal={arXiv preprint arXiv:2503.22480}, year={ 2025 } }