Goal Misgeneralization: Why Correct Specifications Aren't Enough For
Correct Goals

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

4 October 2022

Victoria Krakovna

Papers citing "Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals"

19 / 19 papers shown

Title
An alignment safety case sketch based on debate Marie Davidsen Buhl Jacob Pfau Benjamin Hilton Geoffrey Irving 38 0 0 06 May 2025
Estimating the Probabilities of Rare Outputs in Language Models Gabriel Wu Jacob Hilton AAML UQCV 48 2 0 17 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 43 10 0 09 Oct 2024
Towards shutdownable agents via stochastic choice Elliott Thornley Alexander Roman Christos Ziakas Leyton Ho Louis Thomson 38 0 0 30 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 47 23 0 11 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence Edward Hughes Michael Dennis Jack Parker-Holder Feryal M. P. Behbahani Aditi Mavalankar Yuge Shi Tom Schaul Tim Rocktaschel LRM 37 21 0 06 Jun 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 36 6 0 21 Nov 2023
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking Rose Hadshar 26 6 0 27 Oct 2023
Language Reward Modulation for Pretraining Reinforcement Learning Ademi Adeniji Amber Xie Carmelo Sferrazza Younggyo Seo Stephen James Pieter Abbeel 39 26 0 23 Aug 2023
Concept Extrapolation: A Conceptual Primer Matija Franklin Rebecca Gorman Hal Ashton Stuart Armstrong 10 1 0 19 Jun 2023
Model evaluation for extreme risks Toby Shevlane Sebastian Farquhar Ben Garfinkel Mary Phuong Jess Whittlestone ... Vijay Bolina Jack Clark Yoshua Bengio Paul Christiano Allan Dafoe ELM 37 152 0 24 May 2023
Power-seeking can be probable and predictive for trained agents Victoria Krakovna János Kramár TDI 27 16 0 13 Apr 2023
Adversarial Cheap Talk Chris Xiaoxuan Lu Timon Willi Alistair Letcher Jakob N. Foerster AAML 24 17 0 20 Nov 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 59 183 0 30 Aug 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 319 11,953 0 04 Mar 2022
Goal Misgeneralization in Deep Reinforcement Learning L. Langosco Jack Koch Lee D. Sharkey J. Pfau Laurent Orseau David M. Krueger 24 78 0 28 May 2021
AI safety via debate G. Irving Paul Christiano Dario Amodei 204 200 0 02 May 2018
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 276 5,661 0 05 Dec 2016
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 299 2,890 0 15 Sep 2016