ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.20556
5
0

Learning a Pessimistic Reward Model in RLHF

26 May 2025
Yinglun Xu
Hangoo Kang
Tarun Suresh
Yuxuan Wan
Gagandeep Singh
    OffRL
ArXivPDFHTML
Abstract

This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any regularization. Such a policy has a high KL divergence from the dataset distribution while having high performance in practice. In summary, our work shows the feasibility of learning a pessimistic reward model against reward hacking. The agent can greedily search for the policy with a high pessimistic reward without suffering from reward hacking.

View on arXiv
@article{xu2025_2505.20556,
  title={ Learning a Pessimistic Reward Model in RLHF },
  author={ Yinglun Xu and Hangoo Kang and Tarun Suresh and Yuxuan Wan and Gagandeep Singh },
  journal={arXiv preprint arXiv:2505.20556},
  year={ 2025 }
}
Comments on this paper