Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.08379
Cited By
Scheming AIs: Will AIs fake alignment during training in order to get power?
14 November 2023
Joe Carlsmith
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Scheming AIs: Will AIs fake alignment during training in order to get power?"
18 / 18 papers shown
Title
Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt
Cody Rushing
Adam Kaufman
Tyler Tracy
Vasil Georgiev
David Matolcsi
Akbir Khan
B. S.
AAML
30
1
0
14 Apr 2025
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
Mikita Balesni
Buck Shlegeris
Geoffrey Irving
ELM
27
1
0
07 Apr 2025
Safety Guardrails for LLM-Enabled Robots
Zachary Ravichandran
Alexander Robey
Vijay R. Kumar
George Pappas
Hamed Hassani
61
2
0
10 Mar 2025
A sketch of an AI control safety case
Tomek Korbak
Joshua Clymer
Benjamin Hilton
Buck Shlegeris
Geoffrey Irving
80
5
0
28 Jan 2025
Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Cameron Tice
Philipp Alexander Kreer
Nathan Helm-Burger
Prithviraj Singh Shahani
Fedor Ryzhenkov
Jacob Haimes
Felix Hofstätter
Teun van der Weij
82
1
0
02 Dec 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Jiaxin Wen
Vivek Hebbar
Caleb Larson
Aryan Bhatt
Ansh Radhakrishnan
...
Shi Feng
He He
Ethan Perez
Buck Shlegeris
Akbir Khan
AAML
81
8
0
26 Nov 2024
The Two-Hop Curse: LLMs trained on A
→
\rightarrow
→
B, B
→
\rightarrow
→
C fail to learn A
→
\rightarrow
→
C
Mikita Balesni
Tomek Korbak
Owain Evans
ReLM
LRM
81
0
0
25 Nov 2024
Safety case template for frontier AI: A cyber inability argument
Arthur Goemans
Marie Davidsen Buhl
Jonas Schuett
Tomek Korbak
Jessica Wang
Benjamin Hilton
Geoffrey Irving
58
15
0
12 Nov 2024
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
47
9
0
29 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
35
12
0
17 Oct 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
43
29
0
05 Jul 2024
Identifying the Source of Generation for Large Language Models
Bumjin Park
Jaesik Choi
29
0
0
05 Jul 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
47
23
0
11 Jun 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer
Caden Juang
Severin Field
CVBM
34
1
0
08 May 2024
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi
Evan Hubinger
34
12
0
25 Apr 2024
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
102
37
0
22 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
50
258
0
14 Dec 2023
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
31
40
0
12 Dec 2023
1