Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.01350
Cited By
Honesty Is the Best Policy: Defining and Mitigating AI Deception
3 December 2023
Francis Rhys Ward
Francesco Belardinelli
Francesca Toni
Tom Everitt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Honesty Is the Best Policy: Defining and Mitigating AI Deception"
41 / 41 papers shown
Title
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
76
28
0
11 Jun 2024
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
68
10
0
31 May 2024
Robust agents learn causal world models
Jonathan G. Richens
Tom Everitt
OOD
137
43
0
16 Feb 2024
The Reasons that Agents Act: Intention and Instrumental Goals
Francis Rhys Ward
Matt MacDermott
Francesco Belardinelli
Francesca Toni
Tom Everitt
AI4CE
50
13
0
11 Feb 2024
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Lorenzo Pacchiardi
A. J. Chan
Sören Mindermann
Ilan Moscovitz
Alexa Y. Pan
Y. Gal
Owain Evans
J. Brauner
LLMAG
HILM
58
52
0
26 Sep 2023
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund
Asa Cooper Stickland
Mikita Balesni
Max Kaufmann
Meg Tong
Tomasz Korbak
Daniel Kokotajlo
Owain Evans
LLMAG
LRM
71
67
0
01 Sep 2023
Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks
B. Levinstein
Daniel A. Herrmann
57
61
0
30 Jun 2023
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
52
179
0
21 Jun 2023
Reasoning about Causality in Games
Lewis Hammond
James Fox
Tom Everitt
Ryan Carey
Alessandro Abate
Michael Wooldridge
LRM
AI4CE
30
16
0
05 Jan 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
52
398
0
19 Dec 2022
Talking About Large Language Models
Murray Shanahan
AI4CE
89
266
0
07 Dec 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
314
514
0
24 Sep 2022
Discovering Agents
Zachary Kenton
Ramana Kumar
Sebastian Farquhar
Jonathan G. Richens
Matt MacDermott
Tom Everitt
CML
71
31
0
17 Aug 2022
Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning
Julien Perolat
Bart De Vylder
Daniel Hennes
Eugene Tarassov
Florian Strub
...
Rémi Munos
David Silver
Satinder Singh
Demis Hassabis
K. Tuyls
75
192
0
30 Jun 2022
Is Power-Seeking AI an Existential Risk?
Joseph Carlsmith
ELM
58
87
0
16 Jun 2022
Path-Specific Objectives for Safer Agent Incentives
Sebastian Farquhar
Ryan Carey
Tom Everitt
46
27
0
21 Apr 2022
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
194
1,944
0
29 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
871
12,916
0
04 Mar 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
Mohammad Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
145
740
0
28 Jan 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
799
9,351
0
28 Jan 2022
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae
Sebastian Borgeaud
Trevor Cai
Katie Millican
Jordan Hoffmann
...
Jeff Stanway
L. Bennett
Demis Hassabis
Koray Kavukcuoglu
G. Irving
122
1,311
0
08 Dec 2021
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
283
116
0
13 Oct 2021
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie C. Lin
Jacob Hilton
Owain Evans
HILM
137
1,897
0
08 Sep 2021
Definitions of intent suitable for algorithms
Hal Ashton
43
18
0
08 Jun 2021
Extending counterfactual accounts of intent to include oblique intent
Hal Ashton
136
3
0
07 Jun 2021
Alignment of Language Agents
Zachary Kenton
Tom Everitt
Laura Weidinger
Iason Gabriel
Vladimir Mikulik
G. Irving
70
165
0
26 Mar 2021
Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice
Lewis Hammond
James Fox
Tom Everitt
Alessandro Abate
Michael Wooldridge
45
10
0
09 Feb 2021
Agent Incentives: A Causal Perspective
Tom Everitt
Ryan Carey
Eric D. Langlois
Pedro A. Ortega
Shane Legg
CML
50
54
0
02 Feb 2021
Open Problems in Cooperative AI
Allan Dafoe
Edward Hughes
Yoram Bachrach
Tantum Collins
Kevin R. McKee
Joel Z Leibo
Kate Larson
T. Graepel
98
202
0
15 Dec 2020
Studying Dishonest Intentions in Brazilian Portuguese Texts
F. Vargas
T. Pardo
11
2
0
13 Aug 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
743
41,932
0
28 May 2020
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
460
1,727
0
18 Sep 2019
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
Tom Everitt
Marcus Hutter
Ramana Kumar
Victoria Krakovna
61
95
0
13 Aug 2019
Risks from Learned Optimization in Advanced Machine Learning Systems
Evan Hubinger
Chris van Merwijk
Vladimir Mikulik
Joar Skalse
Scott Garrabrant
78
151
0
05 Jun 2019
Do GANs leave artificial fingerprints?
Francesco Marra
Diego Gragnaniello
L. Verdoliva
Giovanni Poggi
GAN
55
322
0
31 Dec 2018
Towards Formal Definitions of Blameworthiness, Intention, and Moral Responsibility
Joseph Y. Halpern
Max Kleiman-Weiner
XAI
32
81
0
13 Oct 2018
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry
Aleksandar Makelov
Ludwig Schmidt
Dimitris Tsipras
Adrian Vladu
SILM
OOD
304
12,063
0
19 Jun 2017
Deal or No Deal? End-to-End Learning for Negotiation Dialogues
M. Lewis
Denis Yarats
Yann N. Dauphin
Devi Parikh
Dhruv Batra
LLMAG
80
414
0
16 Jun 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
687
131,526
0
12 Jun 2017
Certified Defenses for Data Poisoning Attacks
Jacob Steinhardt
Pang Wei Koh
Percy Liang
AAML
86
754
0
09 Jun 2017
The AGI Containment Problem
James Babcock
János Kramár
Roman V. Yampolskiy
61
276
0
02 Apr 2016
1