ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.09561
47
0

Strategyproof Reinforcement Learning from Human Feedback

13 March 2025
Thomas Kleine Buening
Jiarui Gan
Debmalya Mandal
Marta Z. Kwiatkowska
ArXivPDFHTML
Abstract

We study Reinforcement Learning from Human Feedback (RLHF), where multiple individuals with diverse preferences provide feedback strategically to sway the final policy in their favor. We show that existing RLHF methods are not strategyproof, which can result in learning a substantially misaligned policy even when only one out of kkk individuals reports their preferences strategically. In turn, we also find that any strategyproof RLHF algorithm must perform kkk-times worse than the optimal policy, highlighting an inherent trade-off between incentive alignment and policy alignment. We then propose a pessimistic median algorithm that, under appropriate coverage assumptions, is approximately strategyproof and converges to the optimal policy as the number of individuals and samples increases.

View on arXiv
@article{buening2025_2503.09561,
  title={ Strategyproof Reinforcement Learning from Human Feedback },
  author={ Thomas Kleine Buening and Jiarui Gan and Debmalya Mandal and Marta Kwiatkowska },
  journal={arXiv preprint arXiv:2503.09561},
  year={ 2025 }
}
Comments on this paper