Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2201.03544
Cited By
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
10 January 2022
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models"
30 / 130 papers shown
Title
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
26
165
0
21 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
35
136
0
07 Jun 2023
Survival Instinct in Offline Reinforcement Learning
Anqi Li
Dipendra Kumar Misra
Andrey Kolobov
Ching-An Cheng
OffRL
22
16
0
05 Jun 2023
PAGAR: Taming Reward Misalignment in Inverse Reinforcement Learning-Based Imitation Learning with Protagonist Antagonist Guided Adversarial Reward
Weichao Zhou
Wenchao Li
26
0
0
02 Jun 2023
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit
Johan Ferret
Lior Shani
Roee Aharoni
Geoffrey Cideron
...
Olivier Bachem
G. Elidan
Avinatan Hassidim
Olivier Pietquin
Idan Szpektor
HILM
25
76
0
31 May 2023
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Denny Zhou
Andrew M. Dai
Diyi Yang
Soroush Vosoughi
ALM
37
45
0
26 May 2023
Fundamental Limitations of Alignment in Large Language Models
Yotam Wolf
Noam Wies
Oshri Avnery
Yoav Levine
Amnon Shashua
ALM
16
139
0
19 Apr 2023
Positive AI: Key Challenges in Designing Artificial Intelligence for Wellbeing
Willem van der Maden
Derek Lomas
Malak Sadek
P. Hekkert
37
1
0
12 Apr 2023
Natural Selection Favors AIs over Humans
Dan Hendrycks
23
31
0
28 Mar 2023
Reward Design with Language Models
Minae Kwon
Sang Michael Xie
Kalesha Bullard
Dorsa Sadigh
LM&Ro
27
201
0
27 Feb 2023
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
47
382
0
12 Jan 2023
On The Fragility of Learned Reward Functions
Lev McKinney
Yawen Duan
David M. Krueger
Adam Gleave
28
20
0
09 Jan 2023
Few-Shot Preference Learning for Human-in-the-Loop RL
Joey Hejna
Dorsa Sadigh
OffRL
24
91
0
06 Dec 2022
Misspecification in Inverse Reinforcement Learning
Joar Skalse
Alessandro Abate
25
22
0
06 Dec 2022
Reward Gaming in Conditional Text Generation
Richard Yuanzhe Pang
Vishakh Padmakumar
Thibault Sellam
Ankur P. Parikh
He He
32
24
0
16 Nov 2022
Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems
Eric Yang Yu
Zhizhen Qin
Min Kyung Lee
Sicun Gao
OffRL
37
15
0
22 Oct 2022
Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities
Jasmina Gajcin
Ivana Dusparic
CML
OffRL
35
8
0
21 Oct 2022
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
41
475
0
19 Oct 2022
Reward Learning with Trees: Methods and Evaluation
Tom Bewley
J. Lawry
Arthur G. Richards
R. Craddock
Ian Henderson
23
1
0
03 Oct 2022
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
59
55
0
27 Sep 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
460
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
54
183
0
30 Aug 2022
RL with KL penalties is better viewed as Bayesian inference
Tomasz Korbak
Ethan Perez
Christopher L. Buckley
OffRL
38
73
0
23 May 2022
Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
J. Tien
Jerry Zhi-Yang He
Zackory M. Erickson
Anca Dragan
Daniel S. Brown
CML
33
40
0
13 Apr 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
72
2,330
0
12 Apr 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
273
0
28 Sep 2021
Goal Misgeneralization in Deep Reinforcement Learning
L. Langosco
Jack Koch
Lee D. Sharkey
J. Pfau
Laurent Orseau
David M. Krueger
22
78
0
28 May 2021
Reward (Mis)design for Autonomous Driving
W. B. Knox
A. Allievi
Holger Banzhaf
Felix Schmitt
Peter Stone
83
113
0
28 Apr 2021
Reinforcement Learning for Optimization of COVID-19 Mitigation policies
Varun Kompella
Roberto Capobianco
Stacy Jong
Jonathan Browne
S. Fox
L. Meyers
Peter R. Wurman
Peter Stone
72
47
0
20 Oct 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
246
4,489
0
23 Jan 2020
Previous
1
2
3