Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.14503
Cited By
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
19 July 2024
Thomas Kwa
Drake Thomas
Adrià Garriga-Alonso
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification"
1 / 1 papers shown
Title
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
280
1,595
0
18 Sep 2019
1