Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

5 March 2024

Papers citing "Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking"

5 / 5 papers shown

Title
Learning to Assist Humans without Inferring Rewards Vivek Myers Evan Ellis Sergey Levine Benjamin Eysenbach Anca Dragan 43 2 0 17 Jan 2025
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 227 502 0 28 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 328 11,953 0 04 Mar 2022
Reward (Mis)design for Autonomous Driving W. B. Knox A. Allievi Holger Banzhaf Felix Schmitt Peter Stone 83 113 0 28 Apr 2021
Reinforcement Learning for Optimization of COVID-19 Mitigation policies Varun Kompella Roberto Capobianco Stacy Jong Jonathan Browne S. Fox L. Meyers Peter R. Wurman Peter Stone 75 47 0 20 Oct 2020