Honesty Is the Best Policy: Defining and Mitigating AI Deception

3 December 2023

Francis Rhys Ward

Francesco Belardinelli

Francesca Toni

Tom Everitt

ArXiv PDF HTML

Papers citing "Honesty Is the Best Policy: Defining and Mitigating AI Deception"

41 / 41 papers shown

Title
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 76 28 0 11 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 68 10 0 31 May 2024
Robust agents learn causal world models Jonathan G. Richens Tom Everitt OOD 137 43 0 16 Feb 2024
The Reasons that Agents Act: Intention and Instrumental Goals Francis Rhys Ward Matt MacDermott Francesco Belardinelli Francesca Toni Tom Everitt AI4CE 50 13 0 11 Feb 2024
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions Lorenzo Pacchiardi A. J. Chan Sören Mindermann Ilan Moscovitz Alexa Y. Pan Y. Gal Owain Evans J. Brauner LLMAG HILM 58 52 0 26 Sep 2023
Taken out of context: On measuring situational awareness in LLMs Lukas Berglund Asa Cooper Stickland Mikita Balesni Max Kaufmann Meg Tong Tomasz Korbak Daniel Kokotajlo Owain Evans LLMAG LRM 71 67 0 01 Sep 2023
Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks B. Levinstein Daniel A. Herrmann 57 61 0 30 Jun 2023
An Overview of Catastrophic AI Risks Dan Hendrycks Mantas Mazeika Thomas Woodside SILM 52 179 0 21 Jun 2023
Reasoning about Causality in Games Lewis Hammond James Fox Tom Everitt Ryan Carey Alessandro Abate Michael Wooldridge LRM AI4CE 30 16 0 05 Jan 2023
Discovering Language Model Behaviors with Model-Written Evaluations Ethan Perez Sam Ringer Kamilė Lukošiūtė Karina Nguyen Edwin Chen ... Danny Hernandez Deep Ganguli Evan Hubinger Nicholas Schiefer Jared Kaplan ALM 52 398 0 19 Dec 2022
Talking About Large Language Models Murray Shanahan AI4CE 89 266 0 07 Dec 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 314 514 0 24 Sep 2022
Discovering Agents Zachary Kenton Ramana Kumar Sebastian Farquhar Jonathan G. Richens Matt MacDermott Tom Everitt CML 71 31 0 17 Aug 2022
Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning Julien Perolat Bart De Vylder Daniel Hennes Eugene Tarassov Florian Strub ... Rémi Munos David Silver Satinder Singh Demis Hassabis K. Tuyls 75 192 0 30 Jun 2022
Is Power-Seeking AI an Existential Risk? Joseph Carlsmith ELM 58 87 0 16 Jun 2022
Path-Specific Objectives for Safer Agent Incentives Sebastian Farquhar Ryan Carey Tom Everitt 46 27 0 21 Apr 2022
Training Compute-Optimal Large Language Models Jordan Hoffmann Sebastian Borgeaud A. Mensch Elena Buchatskaya Trevor Cai ... Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals Laurent Sifre AI4TS 194 1,944 0 29 Mar 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 871 12,916 0 04 Mar 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model Shaden Smith M. Patwary Brandon Norick P. LeGresley Samyam Rajbhandari ... Mohammad Shoeybi Yuxiong He Michael Houston Saurabh Tiwary Bryan Catanzaro MoE 145 740 0 28 Jan 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 799 9,351 0 28 Jan 2022
Scaling Language Models: Methods, Analysis & Insights from Training Gopher Jack W. Rae Sebastian Borgeaud Trevor Cai Katie Millican Jordan Hoffmann ... Jeff Stanway L. Bennett Demis Hassabis Koray Kavukcuoglu G. Irving 122 1,311 0 08 Dec 2021
Truthful AI: Developing and governing AI that does not lie Owain Evans Owen Cotton-Barratt Lukas Finnveden Adam Bales Avital Balwit Peter Wills Luca Righetti William Saunders HILM 283 116 0 13 Oct 2021
TruthfulQA: Measuring How Models Mimic Human Falsehoods Stephanie C. Lin Jacob Hilton Owain Evans HILM 137 1,897 0 08 Sep 2021
Definitions of intent suitable for algorithms Hal Ashton 43 18 0 08 Jun 2021
Extending counterfactual accounts of intent to include oblique intent Hal Ashton 136 3 0 07 Jun 2021
Alignment of Language Agents Zachary Kenton Tom Everitt Laura Weidinger Iason Gabriel Vladimir Mikulik G. Irving 70 165 0 26 Mar 2021
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice Lewis Hammond James Fox Tom Everitt Alessandro Abate Michael Wooldridge 45 10 0 09 Feb 2021
Agent Incentives: A Causal Perspective Tom Everitt Ryan Carey Eric D. Langlois Pedro A. Ortega Shane Legg CML 50 54 0 02 Feb 2021
Open Problems in Cooperative AI Allan Dafoe Edward Hughes Yoram Bachrach Tantum Collins Kevin R. McKee Joel Z Leibo Kate Larson T. Graepel 98 202 0 15 Dec 2020
Studying Dishonest Intentions in Brazilian Portuguese Texts F. Vargas T. Pardo 11 2 0 13 Aug 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 743 41,932 0 28 May 2020
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 460 1,727 0 18 Sep 2019
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective Tom Everitt Marcus Hutter Ramana Kumar Victoria Krakovna 61 95 0 13 Aug 2019
Risks from Learned Optimization in Advanced Machine Learning Systems Evan Hubinger Chris van Merwijk Vladimir Mikulik Joar Skalse Scott Garrabrant 78 151 0 05 Jun 2019
Do GANs leave artificial fingerprints? Francesco Marra Diego Gragnaniello L. Verdoliva Giovanni Poggi GAN 55 322 0 31 Dec 2018
Towards Formal Definitions of Blameworthiness, Intention, and Moral Responsibility Joseph Y. Halpern Max Kleiman-Weiner XAI 32 81 0 13 Oct 2018
Towards Deep Learning Models Resistant to Adversarial Attacks Aleksander Madry Aleksandar Makelov Ludwig Schmidt Dimitris Tsipras Adrian Vladu SILM OOD 304 12,063 0 19 Jun 2017
Deal or No Deal? End-to-End Learning for Negotiation Dialogues M. Lewis Denis Yarats Yann N. Dauphin Devi Parikh Dhruv Batra LLMAG 80 414 0 16 Jun 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 687 131,526 0 12 Jun 2017
Certified Defenses for Data Poisoning Attacks Jacob Steinhardt Pang Wei Koh Percy Liang AAML 86 754 0 09 Jun 2017
The AGI Containment Problem James Babcock János Kramár Roman V. Yampolskiy 61 276 0 02 Apr 2016