Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15217
Cited By
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
27 July 2023
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
Javier Rando
Rachel Freedman
Tomasz Korbak
David Lindner
Pedro Freire
Tony Wang
Samuel Marks
Charbel-Raphaël Ségerie
Micah Carroll
Andi Peng
Phillip J. K. Christoffersen
Mehul Damani
Stewart Slocum
Usman Anwar
Anand Siththaranjan
Max Nadeau
Eric J. Michaud
J. Pfau
Dmitrii Krasheninnikov
Xin Chen
L. Langosco
Peter Hase
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback"
31 / 131 papers shown
Title
Consequences of Misaligned AI
Simon Zhuang
Dylan Hadfield-Menell
62
75
0
07 Feb 2021
Understanding Learned Reward Functions
Eric J. Michaud
Adam Gleave
Stuart J. Russell
XAI
OffRL
64
34
0
10 Dec 2020
An overview of 11 proposals for building safe advanced AI
Evan Hubinger
AAML
60
23
0
04 Dec 2020
Inverse Constrained Reinforcement Learning
Usman Anwar
Shehryar Malik
Alireza Aghasi
Ali Ahmed
61
59
0
19 Nov 2020
Learning to be Safe: Deep RL with a Safety Critic
K. Srinivasan
Benjamin Eysenbach
Sehoon Ha
Jie Tan
Chelsea Finn
OffRL
80
144
0
27 Oct 2020
Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI
Alon Jacovi
Ana Marasović
Tim Miller
Yoav Goldberg
297
443
0
15 Oct 2020
Learning Rewards from Linguistic Feedback
T. Sumers
Mark K. Ho
Robert D. Hawkins
Karthik Narasimhan
Thomas Griffiths
107
54
0
30 Sep 2020
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman
Suchin Gururangan
Maarten Sap
Yejin Choi
Noah A. Smith
152
1,199
0
24 Sep 2020
Hidden Incentives for Auto-Induced Distributional Shift
David M. Krueger
Tegan Maharaj
Jan Leike
67
51
0
19 Sep 2020
Assisted Perception: Optimizing Observations to Communicate State
S. Reddy
Sergey Levine
Anca Dragan
72
15
0
06 Aug 2020
Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning
Rahul Pandey
Hemant Purohit
Carlos Castillo
V. Shalin
20
36
0
07 Jul 2020
AI Research Considerations for Human Existential Safety (ARCHES)
Andrew Critch
David M. Krueger
90
53
0
30 May 2020
Active Preference-Based Gaussian Process Regression for Reward Learning
Erdem Biyik
Nicolas Huynh
Mykel J. Kochenderfer
Dorsa Sadigh
GP
66
108
0
06 May 2020
Reward-rational (implicit) choice: A unifying formalism for reward learning
Hong Jun Jeon
S. Milli
Anca Dragan
71
177
0
12 Feb 2020
The Windfall Clause: Distributing the Benefits of AI for the Common Good
Cullen O'Keefe
P. Cihon
Ben Garfinkel
Carrick Flynn
Jade Leung
Allan Dafoe
18
39
0
25 Dec 2019
Optimal Policies Tend to Seek Power
Alexander Matt Turner
Logan Smith
Rohin Shah
Andrew Critch
Prasad Tadepalli
46
70
0
03 Dec 2019
Asking Easy Questions: A User-Friendly Approach to Active Reward Learning
Erdem Biyik
Malayandi Palan
Nicholas C. Landolfi
Dylan P. Losey
Dorsa Sadigh
41
116
0
10 Oct 2019
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
466
1,734
0
18 Sep 2019
On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference
Rohin Shah
Noah Gundotra
Pieter Abbeel
Anca Dragan
43
72
0
23 Jun 2019
A Survey of Reinforcement Learning Informed by Natural Language
Jelena Luketina
Nantas Nardelli
Gregory Farquhar
Jakob N. Foerster
Jacob Andreas
Edward Grefenstette
Shimon Whiteson
Tim Rocktaschel
LM&Ro
KELM
OffRL
LRM
78
282
0
10 Jun 2019
Adversarial Policies: Attacking Deep Reinforcement Learning
Adam Gleave
Michael Dennis
Cody Wild
Neel Kant
Sergey Levine
Stuart J. Russell
AAML
80
355
0
25 May 2019
Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
Daniel S. Brown
Wonjoon Goo
P. Nagarajan
S. Niekum
71
357
0
12 Apr 2019
Using Natural Language for Reward Shaping in Reinforcement Learning
Prasoon Goyal
S. Niekum
Raymond J. Mooney
LM&Ro
87
182
0
05 Mar 2019
Scalable agent alignment via reward modeling: a research direction
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
93
413
0
19 Nov 2018
Adversarial Examples: Opportunities and Challenges
Jiliang Zhang
Chen Li
AAML
55
234
0
13 Sep 2018
A Voting-Based System for Ethical Decision Making
Ritesh Noothigattu
Snehalkumar Ñeil' S. Gaikwad
E. Awad
Sohan Dsouza
Iyad Rahwan
Pradeep Ravikumar
Ariel D. Procaccia
FaML
41
199
0
20 Sep 2017
Deep Reinforcement Learning that Matters
Peter Henderson
Riashat Islam
Philip Bachman
Joelle Pineau
Doina Precup
David Meger
OffRL
118
1,954
0
19 Sep 2017
Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
Khanh Nguyen
Hal Daumé
Jordan L. Boyd-Graber
65
138
0
24 Jul 2017
Delving into adversarial attacks on deep policies
Jernej Kos
D. Song
AAML
59
226
0
18 May 2017
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Dan Hendrycks
Kevin Gimpel
UQCV
158
3,454
0
07 Oct 2016
BPR: Bayesian Personalized Ranking from Implicit Feedback
Steffen Rendle
Christoph Freudenthaler
Zeno Gantner
Lars Schmidt-Thieme
BDL
150
5,727
0
09 May 2012
Previous
1
2
3