ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.06554
296
12
v1v2 (latest)

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
9 October 2024
Yanjun Chen
Dawei Zhu
Yirong Sun
Xinghao Chen
Wei Zhang
Xiaoyu Shen
    ALM
ArXiv (abs)PDFHTMLGithub (14★)
Main:5 Pages
27 Figures
Bibliography:2 Pages
7 Tables
Appendix:3 Pages
Abstract

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at https://github.com/EIT-NLP/AccuracyParadox-RLHF.

View on arXiv
Comments on this paper