ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2105.14111
  4. Cited By
Goal Misgeneralization in Deep Reinforcement Learning

Goal Misgeneralization in Deep Reinforcement Learning

28 May 2021
L. Langosco
Jack Koch
Lee D. Sharkey
J. Pfau
Laurent Orseau
David M. Krueger
ArXivPDFHTML

Papers citing "Goal Misgeneralization in Deep Reinforcement Learning"

50 / 60 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
71
0
0
05 May 2025
Evaluating the Goal-Directedness of Large Language Models
Evaluating the Goal-Directedness of Large Language Models
Tom Everitt
Cristina Garbacea
Alexis Bellot
Jonathan G. Richens
Henry Papadatos
Simeon Campos
Rohin Shah
ELM
LM&MA
LM&Ro
LRM
72
0
0
16 Apr 2025
Better Decisions through the Right Causal World Model
Better Decisions through the Right Causal World Model
Elisabeth Dillies
Quentin Delfosse
Jannis Blüml
Raban Emunds
Florian Peter Busch
Kristian Kersting
34
0
0
09 Apr 2025
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Xuerui Su
Shufang Xie
Guoqing Liu
Yingce Xia
Renqian Luo
Peiran Jin
Zhiming Ma
Yue Wang
Zun Wang
Yuting Liu
LRM
34
2
0
06 Apr 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
90
1
0
16 Mar 2025
Measuring Goal-Directedness
Measuring Goal-Directedness
Matt MacDermott
James Fox
Francesco Belardinelli
Tom Everitt
96
1
0
06 Dec 2024
Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption
  In Zero-Shot Coordination Games
Noisy Zero-Shot Coordination: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games
Usman Anwar
Ashish Pandian
Jia Wan
David M. Krueger
Jakob N. Foerster
34
0
0
07 Nov 2024
Mechanistic Interpretability of Reinforcement Learning Agents
Mechanistic Interpretability of Reinforcement Learning Agents
Tristan Trim
Triston Grayston
AI4CE
27
0
0
30 Oct 2024
Predicting Future Actions of Reinforcement Learning Agents
Predicting Future Actions of Reinforcement Learning Agents
Stephen Chung
Scott Niekum
David M. Krueger
29
1
0
29 Oct 2024
Interpretable end-to-end Neurosymbolic Reinforcement Learning agents
Interpretable end-to-end Neurosymbolic Reinforcement Learning agents
Nils Grandien
Quentin Delfosse
Kristian Kersting
OffRL
27
2
0
18 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
10
0
09 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized
  Distributions
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
34
4
0
06 Oct 2024
Beyond Preferences in AI Alignment
Beyond Preferences in AI Alignment
Tan Zhi-Xuan
Micah Carroll
Matija Franklin
Hal Ashton
41
16
0
30 Aug 2024
On the Undecidability of Artificial Intelligence Alignment: Machines
  that Halt
On the Undecidability of Artificial Intelligence Alignment: Machines that Halt
Gabriel Adriano de Melo
Marcos Ricardo Omena de Albuquerque Máximo
Nei Yoshihiro Soma
Paulo Andre Lima de Castro
32
0
0
16 Aug 2024
Exploring and Addressing Reward Confusion in Offline Preference Learning
Exploring and Addressing Reward Confusion in Offline Preference Learning
Xin Chen
Sam Toyer
Florian Shkurti
OffRL
19
0
0
22 Jul 2024
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft
  Agent
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys
George Adamopoulos
Mehrab Hamidi
Stephanie Milani
Mohammad Reza Samsami
Artem Zholus
Sonia Joseph
Blake A. Richards
Irina Rish
Özgür Simsek
42
2
0
16 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
39
12
0
06 Jul 2024
Towards shutdownable agents via stochastic choice
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
38
0
0
30 Jun 2024
Aligning Model Properties via Conformal Risk Control
Aligning Model Properties via Conformal Risk Control
William Overman
Jacqueline Jil Vallon
Mohsen Bayati
33
2
0
26 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence
Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes
Michael Dennis
Jack Parker-Holder
Feryal M. P. Behbahani
Aditi Mavalankar
Yuge Shi
Tom Schaul
Tim Rocktaschel
LRM
37
21
0
06 Jun 2024
HackAtari: Atari Learning Environments for Robust and Continual
  Reinforcement Learning
HackAtari: Atari Learning Environments for Robust and Continual Reinforcement Learning
Quentin Delfosse
Jannis Blüml
Bjarne Gregori
Kristian Kersting
31
7
0
06 Jun 2024
Towards a Research Community in Interpretable Reinforcement Learning:
  the InterpPol Workshop
Towards a Research Community in Interpretable Reinforcement Learning: the InterpPol Workshop
Hector Kohler
Quentin Delfosse
Paul Festor
Philippe Preux
35
0
0
16 Apr 2024
Survey on Large Language Model-Enhanced Reinforcement Learning: Concept,
  Taxonomy, and Methods
Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
Yuji Cao
Huan Zhao
Yuheng Cheng
Ting Shu
Guolong Liu
Gaoqi Liang
Junhua Zhao
Yun Li
LLMAG
KELM
OffRL
LM&Ro
35
50
0
30 Mar 2024
Incentive Compatibility for AI Alignment in Sociotechnical Systems:
  Positions and Prospects
Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects
Zhaowei Zhang
Fengshuo Bai
Mingzhi Wang
Haoyang Ye
Chengdong Ma
Yaodong Yang
35
4
0
20 Feb 2024
Robust agents learn causal world models
Robust agents learn causal world models
Jonathan G. Richens
Tom Everitt
OOD
122
36
0
16 Feb 2024
Reward Generalization in RLHF: A Topological Perspective
Reward Generalization in RLHF: A Topological Perspective
Tianyi Qiu
Fanzhi Zeng
Jiaming Ji
Dong Yan
Kaile Wang
Jiayi Zhou
Yang Han
Josef Dai
Xuehai Pan
Yaodong Yang
AI4CE
30
3
0
15 Feb 2024
Agents Need Not Know Their Purpose
Agents Need Not Know Their Purpose
Paulo Garcia
18
0
0
15 Feb 2024
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
Quentin Delfosse
Sebastian Sztwiertnia
M. Rothermel
Wolfgang Stammer
Kristian Kersting
55
18
0
11 Jan 2024
Can Active Sampling Reduce Causal Confusion in Offline Reinforcement
  Learning?
Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning?
Gunshi Gupta
Tim G. J. Rudner
R. McAllister
Adrien Gaidon
Y. Gal
OffRL
53
3
0
28 Dec 2023
Aligning Human Intent from Imperfect Demonstrations with
  Confidence-based Inverse soft-Q Learning
Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning
Xizhou Bu
Wenjuan Li
Zhengxiong Liu
Zhiqiang Ma
Panfeng Huang
20
1
0
18 Dec 2023
Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A
  Case Study
Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study
Karolis Ramanauskas
Özgür Simsek
29
0
0
05 Dec 2023
Risk-averse Batch Active Inverse Reward Design
Risk-averse Batch Active Inverse Reward Design
Panagiotis Liampas
15
0
0
20 Nov 2023
Improving Generalization of Alignment with Human Preferences through
  Group Invariant Learning
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Shihan Dou
...
Xiao Wang
Haoran Huang
Tao Gui
Qi Zhang
Xuanjing Huang
56
14
0
18 Oct 2023
Understanding and Controlling a Maze-Solving Policy Network
Understanding and Controlling a Maze-Solving Policy Network
Ulisse Mini
Peli Grietzer
Mrinank Sharma
Austin Meek
M. MacDiarmid
Alexander Matt Turner
14
15
0
12 Oct 2023
Dynamic value alignment through preference aggregation of multiple
  objectives
Dynamic value alignment through preference aggregation of multiple objectives
Marcin Korecki
Damian Dailisan
Cesare Carissimo
33
0
0
09 Oct 2023
How the level sampling process impacts zero-shot generalisation in deep
  reinforcement learning
How the level sampling process impacts zero-shot generalisation in deep reinforcement learning
Samuel Garcin
James Doran
Shangmin Guo
Christopher G. Lucas
Stefano V. Albrecht
46
0
0
05 Oct 2023
CoinRun: Solving Goal Misgeneralisation
CoinRun: Solving Goal Misgeneralisation
Stuart Armstrong
Alexandre Maranhao
Oliver Daniels-Koch
Ioannis Gkioulekas
Rebecca Gormann
LRM
35
0
0
28 Sep 2023
Guide Your Agent with Adaptive Multimodal Rewards
Guide Your Agent with Adaptive Multimodal Rewards
Changyeon Kim
Younggyo Seo
Hao Liu
Lisa Lee
Jinwoo Shin
Honglak Lee
Kimin Lee
23
9
0
19 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
30
141
0
28 Aug 2023
Language Reward Modulation for Pretraining Reinforcement Learning
Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji
Amber Xie
Carmelo Sferrazza
Younggyo Seo
Stephen James
Pieter Abbeel
39
26
0
23 Aug 2023
Stabilizing Unsupervised Environment Design with a Learned Adversary
Stabilizing Unsupervised Environment Design with a Learned Adversary
Ishita Mediratta
Minqi Jiang
Jack Parker-Holder
Michael Dennis
Eugene Vinitsky
Tim Rocktaschel
42
14
0
21 Aug 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from
  Human Feedback
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
47
473
0
27 Jul 2023
Apolitical Intelligence? Auditing Delphi's responses on controversial
  political issues in the US
Apolitical Intelligence? Auditing Delphi's responses on controversial political issues in the US
J. H. Rystrøm
19
0
0
22 Jun 2023
Concept Extrapolation: A Conceptual Primer
Concept Extrapolation: A Conceptual Primer
Matija Franklin
Rebecca Gorman
Hal Ashton
Stuart Armstrong
10
1
0
19 Jun 2023
OCAtari: Object-Centric Atari 2600 Reinforcement Learning Environments
OCAtari: Object-Centric Atari 2600 Reinforcement Learning Environments
Quentin Delfosse
Jannis Blüml
Bjarne Gregori
Sebastian Sztwiertnia
Kristian Kersting
40
17
0
14 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating
  weights fine-tuned on diverse rewards
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
35
136
0
07 Jun 2023
Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in
  RL
Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
Miguel Suau
M. Spaan
F. Oliehoek
CML
24
4
0
04 Jun 2023
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Shalev Lifshitz
Keiran Paster
Harris Chan
Jimmy Ba
Sheila A. McIlraith
LM&Ro
24
67
0
01 Jun 2023
Consistency Regularization for Domain Generalization with Logit
  Attribution Matching
Consistency Regularization for Domain Generalization with Logit Attribution Matching
Han Gao
Kaican Li
Weiyan Xie
Zhi Lin
Yongxiang Huang
Luning Wang
Caleb Chen Cao
N. Zhang
13
2
0
13 May 2023
Approximate Shielding of Atari Agents for Safe Exploration
Approximate Shielding of Atari Agents for Safe Exploration
Alexander W. Goodall
Francesco Belardinelli
27
2
0
21 Apr 2023
12
Next