Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.03185
Cited By
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
5 March 2024
Cassidy Laidlaw
Shivam Singhal
Anca Dragan
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking"
5 / 5 papers shown
Title
Learning to Assist Humans without Inferring Rewards
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
43
2
0
17 Jan 2025
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
227
502
0
28 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
Reward (Mis)design for Autonomous Driving
W. B. Knox
A. Allievi
Holger Banzhaf
Felix Schmitt
Peter Stone
83
113
0
28 Apr 2021
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Varun Kompella
Roberto Capobianco
Stacy Jong
Jonathan Browne
S. Fox
L. Meyers
Peter R. Wurman
Peter Stone
72
47
0
20 Oct 2020
1