Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.01790
Cited By
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
4 October 2022
Rohin Shah
Vikrant Varma
Ramana Kumar
Mary Phuong
Victoria Krakovna
J. Uesato
Zachary Kenton
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals"
19 / 19 papers shown
Title
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
Estimating the Probabilities of Rare Outputs in Language Models
Gabriel Wu
Jacob Hilton
AAML
UQCV
48
2
0
17 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip H. S. Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
10
0
09 Oct 2024
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
38
0
0
30 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
47
23
0
11 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes
Michael Dennis
Jack Parker-Holder
Feryal M. P. Behbahani
Aditi Mavalankar
Yuge Shi
Tom Schaul
Tim Rocktaschel
LRM
37
21
0
06 Jun 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
36
6
0
21 Nov 2023
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
26
6
0
27 Oct 2023
Language Reward Modulation for Pretraining Reinforcement Learning
Ademi Adeniji
Amber Xie
Carmelo Sferrazza
Younggyo Seo
Stephen James
Pieter Abbeel
39
26
0
23 Aug 2023
Concept Extrapolation: A Conceptual Primer
Matija Franklin
Rebecca Gorman
Hal Ashton
Stuart Armstrong
10
1
0
19 Jun 2023
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
37
152
0
24 May 2023
Power-seeking can be probable and predictive for trained agents
Victoria Krakovna
János Kramár
TDI
27
16
0
13 Apr 2023
Adversarial Cheap Talk
Chris Xiaoxuan Lu
Timon Willi
Alistair Letcher
Jakob N. Foerster
AAML
24
17
0
20 Nov 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
59
183
0
30 Aug 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
Goal Misgeneralization in Deep Reinforcement Learning
L. Langosco
Jack Koch
Lee D. Sharkey
J. Pfau
Laurent Orseau
David M. Krueger
24
78
0
28 May 2021
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
204
200
0
02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan
Alexander Pritzel
Charles Blundell
UQCV
BDL
276
5,661
0
05 Dec 2016
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
299
2,890
0
15 Sep 2016
1