ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2201.03544
  4. Cited By
The Effects of Reward Misspecification: Mapping and Mitigating
  Misaligned Models

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

10 January 2022
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
ArXivPDFHTML

Papers citing "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models"

50 / 130 papers shown
Title
HackAtari: Atari Learning Environments for Robust and Continual
  Reinforcement Learning
HackAtari: Atari Learning Environments for Robust and Continual Reinforcement Learning
Quentin Delfosse
Jannis Blüml
Bjarne Gregori
Kristian Kersting
31
7
0
06 Jun 2024
Scaling Laws for Reward Model Overoptimization in Direct Alignment
  Algorithms
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Rafael Rafailov
Yaswanth Chittepu
Ryan Park
Harshit S. Sikchi
Joey Hejna
Bradley Knox
Chelsea Finn
S. Niekum
52
50
0
05 Jun 2024
AI Risk Management Should Incorporate Both Safety and Security
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Bo-wen Li
Dawn Song
Peter Henderson
Prateek Mittal
AAML
51
10
0
29 May 2024
Offline Regularised Reinforcement Learning for Large Language Models
  Alignment
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
M. G. Azar
...
Gil Shamir
Rishabh Joshi
Tianqi Liu
Rémi Munos
Bilal Piot
OffRL
46
22
0
29 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
63
12
0
28 May 2024
Phase Transitions in the Output Distribution of Large Language Models
Phase Transitions in the Output Distribution of Large Language Models
Julian Arnold
Flemming Holtorf
Frank Schafer
Niels Lörch
41
1
0
27 May 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
  AI Systems
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
52
52
0
10 May 2024
Best Practices and Lessons Learned on Synthetic Data for Language Models
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDa
EgoV
41
86
0
11 Apr 2024
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data
Tim Baumgärtner
Yang Gao
Dana Alon
Donald Metzler
AAML
25
18
0
08 Apr 2024
Concept -- An Evaluation Protocol on Conversational Recommender Systems
  with System-centric and User-centric Factors
Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors
Chen Huang
Peixin Qin
Yang Deng
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
36
6
0
04 Apr 2024
Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment
Regularized Best-of-N Sampling with Minimum Bayes Risk Objective for Language Model Alignment
Yuu Jinnai
Tetsuro Morimura
Kaito Ariu
Kenshi Abe
66
7
0
01 Apr 2024
Disentangling Length from Quality in Direct Preference Optimization
Disentangling Length from Quality in Direct Preference Optimization
Ryan Park
Rafael Rafailov
Stefano Ermon
Chelsea Finn
ALM
45
106
0
28 Mar 2024
LORD: Large Models based Opposite Reward Design for Autonomous Driving
LORD: Large Models based Opposite Reward Design for Autonomous Driving
Xin Ye
Feng Tao
Abhirup Mallik
Burhaneddin Yaman
Liu Ren
OffRL
37
2
0
27 Mar 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
32
11
0
27 Mar 2024
Safe and Robust Reinforcement Learning: Principles and Practice
Safe and Robust Reinforcement Learning: Principles and Practice
Taku Yamagata
Raúl Santos-Rodríguez
OffRL
43
2
0
27 Mar 2024
Scaling Learning based Policy Optimization for Temporal Tasks via
  Dropout
Scaling Learning based Policy Optimization for Temporal Tasks via Dropout
Navid Hashemi
Bardh Hoxha
Danil Prokhorov
Georgios Fainekos
Jyotirmoy Deshmukh
33
0
0
23 Mar 2024
Human Alignment of Large Language Models through Online Preference
  Optimisation
Human Alignment of Large Language Models through Online Preference Optimisation
Daniele Calandriello
Daniel Guo
Rémi Munos
Mark Rowland
Yunhao Tang
...
Michal Valko
Tianqi Liu
Rishabh Joshi
Zeyu Zheng
Bilal Piot
44
60
0
13 Mar 2024
Overcoming Reward Overoptimization via Adversarial Policy Optimization
  with Lightweight Uncertainty Estimation
Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation
Xiaoying Zhang
Jean-François Ton
Wei Shen
Hongning Wang
Yang Liu
37
13
0
08 Mar 2024
On the Essence and Prospect: An Investigation of Alignment Approaches
  for Big Models
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
38
16
0
07 Mar 2024
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw
Shivam Singhal
Anca Dragan
AAML
19
11
0
05 Mar 2024
DACO: Towards Application-Driven and Comprehensive Data Analysis via
  Code Generation
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation
Xueqing Wu
Rui Zheng
Jingzhen Sha
Te-Lin Wu
Hanyu Zhou
Mohan Tang
Kai-Wei Chang
Nanyun Peng
Haoran Huang
52
1
0
04 Mar 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip H. S. Torr
Lewis Hammond
Christian Schroeder de Witt
40
4
0
12 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
53
26
0
09 Feb 2024
Explaining Learned Reward Functions with Counterfactual Trajectories
Explaining Learned Reward Functions with Counterfactual Trajectories
Jan Wehner
Frans Oliehoek
Luciano Cavalcante Siebert
29
0
0
07 Feb 2024
Decoding-time Realignment of Language Models
Decoding-time Realignment of Language Models
Tianlin Liu
Shangmin Guo
Leonardo Bianco
Daniele Calandriello
Quentin Berthet
Felipe Llinares-López
Jessica Hoffmann
Lucas Dixon
Michal Valko
Mathieu Blondel
AI4CE
54
35
0
05 Feb 2024
Rethinking the Role of Proxy Rewards in Language Model Alignment
Rethinking the Role of Proxy Rewards in Language Model Alignment
Sungdong Kim
Minjoon Seo
SyDa
ALM
28
0
0
02 Feb 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with
  Representation Engineering
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
31
13
0
29 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
111
93
0
22 Jan 2024
Reinforcement Learning from LLM Feedback to Counteract Goal
  Misgeneralization
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization
Houda Nait El Barj
Théophile Sautory
27
2
0
14 Jan 2024
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate
  Reward Hacking
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
Jacob Eisenstein
Chirag Nagpal
Alekh Agarwal
Ahmad Beirami
Alex DÁmour
...
Katherine Heller
Stephen R. Pfohl
Deepak Ramachandran
Peter Shaw
Jonathan Berant
32
82
0
14 Dec 2023
Omega-Regular Decision Processes
Omega-Regular Decision Processes
E. M. Hahn
Mateo Perez
S. Schewe
F. Somenzi
Ashutosh Trivedi
D. Wojtczak
19
0
0
14 Dec 2023
FoMo Rewards: Can we cast foundation models as reward functions?
FoMo Rewards: Can we cast foundation models as reward functions?
Ekdeep Singh Lubana
Johann Brehmer
P. D. Haan
Taco S. Cohen
OffRL
LRM
48
2
0
06 Dec 2023
Risk-averse Batch Active Inverse Reward Design
Risk-averse Batch Active Inverse Reward Design
Panagiotis Liampas
13
0
0
20 Nov 2023
Value FULCRA: Mapping Large Language Models to the Multidimensional
  Spectrum of Basic Human Values
Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Values
Jing Yao
Xiaoyuan Yi
Xiting Wang
Yifan Gong
Xing Xie
30
21
0
15 Nov 2023
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
  Human Feedback
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
Nathan Lambert
Roberto Calandra
ALM
20
31
0
31 Oct 2023
A Review of the Evidence for Existential Risk from AI via Misaligned
  Power-Seeking
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
20
6
0
27 Oct 2023
Social Contract AI: Aligning AI Assistants with Implicit Group Norms
Social Contract AI: Aligning AI Assistants with Implicit Group Norms
Jan-Philipp Fränken
Sam Kwok
Peixuan Ye
Kanishk Gandhi
Dilip Arumugam
Jared Moore
Alex Tamkin
Tobias Gerstenberg
Noah D. Goodman
31
7
0
26 Oct 2023
Managing extreme AI risks amid rapid progress
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip H. S. Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
26
63
0
26 Oct 2023
Active teacher selection for reinforcement learning from human feedback
Active teacher selection for reinforcement learning from human feedback
Rachel Freedman
Justin Svegliato
K. H. Wray
Stuart J. Russell
31
6
0
23 Oct 2023
Improving Generalization of Alignment with Human Preferences through
  Group Invariant Learning
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Shihan Dou
...
Xiao Wang
Haoran Huang
Tao Gui
Qi Zhang
Xuanjing Huang
56
14
0
18 Oct 2023
Goodhart's Law in Reinforcement Learning
Goodhart's Law in Reinforcement Learning
Jacek Karwowski
Oliver Hayman
Xingjian Bai
Klaus Kiendlhofer
Charlie Griffin
Joar Skalse
34
8
0
13 Oct 2023
SALMON: Self-Alignment with Instructable Reward Models
SALMON: Self-Alignment with Instructable Reward Models
Zhiqing Sun
Yikang Shen
Hongxin Zhang
Qinhong Zhou
Zhenfang Chen
David D. Cox
Yiming Yang
Chuang Gan
ALM
SyDa
35
35
0
09 Oct 2023
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning
  from Human Feedback
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
Wei Shen
Rui Zheng
Wenyu Zhan
Jun Zhao
Shihan Dou
Tao Gui
Qi Zhang
Xuanjing Huang
ALM
40
41
0
08 Oct 2023
Reward Model Ensembles Help Mitigate Overoptimization
Reward Model Ensembles Help Mitigate Overoptimization
Thomas Coste
Usman Anwar
Robert Kirk
David M. Krueger
NoLa
ALM
20
116
0
04 Oct 2023
STARC: A General Framework For Quantifying Differences Between Reward
  Functions
STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse
Lucy Farnik
S. Motwani
Erik Jenner
Adam Gleave
Alessandro Abate
14
9
0
26 Sep 2023
Exploring the impact of low-rank adaptation on the performance,
  efficiency, and regularization of RLHF
Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF
Simeng Sun
Dhawal Gupta
Mohit Iyyer
19
17
0
16 Sep 2023
Iterative Reward Shaping using Human Feedback for Correcting Reward
  Misspecification
Iterative Reward Shaping using Human Feedback for Correcting Reward Misspecification
Jasmina Gajcin
J. McCarthy
Rahul Nair
Radu Marinescu
Elizabeth M. Daly
Ivana Dusparic
23
3
0
30 Aug 2023
VisAlign: Dataset for Measuring the Degree of Alignment between AI and
  Humans in Visual Perception
VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Jiyoung Lee
Seung Wook Kim
Seunghyun Won
Joonseok Lee
Marzyeh Ghassemi
James Thorne
Jaeseok Choi
O.-Kil Kwon
E. Choi
27
1
0
03 Aug 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from
  Human Feedback
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
47
472
0
27 Jul 2023
Let Me Teach You: Pedagogical Foundations of Feedback for Language
  Models
Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
Beatriz Borges
Niket Tandon
Tanja Kaser
Antoine Bosselut
22
3
0
01 Jul 2023
Previous
123
Next