Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.05566
Cited By
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
10 January 2024
Evan Hubinger
Carson E. Denison
Jesse Mu
Mike Lambert
Meg Tong
M. MacDiarmid
Tamera Lanham
Daniel M. Ziegler
Timothy Maxwell
Newton Cheng
Adam Jermyn
Amanda Askell
Ansh Radhakrishnan
Cem Anil
David Duvenaud
Deep Ganguli
Fazl Barez
Jack Clark
Kamal Ndousse
Kshitij Sachan
Michael Sellitto
Mrinank Sharma
Nova Dassarma
Roger C. Grosse
Shauna Kravec
Yuntao Bai
Zachary Witten
Marina Favaro
J. Brauner
Holden Karnofsky
Paul Christiano
Samuel R. Bowman
Logan Graham
Jared Kaplan
Sören Mindermann
Ryan Greenblatt
Buck Shlegeris
Nicholas Schiefer
Ethan Perez
LLMAG
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"
13 / 13 papers shown
Title
Security Concerns for Large Language Models: A Survey
Miles Q. Li
Benjamin C. M. Fung
PILM
ELM
69
0
0
24 May 2025
Discovering Forbidden Topics in Language Models
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
31
0
0
23 May 2025
A Linear Approach to Data Poisoning
Diego Granziol
Donald Flynn
AAML
100
0
0
21 May 2025
Demonstrating specification gaming in reasoning models
Alexander Bondarenko
Denis Volk
Dmitrii Volkov
Jeffrey Ladish
LRM
LLMAG
55
5
0
18 Feb 2025
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
105
0
0
22 Oct 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILM
AAML
69
0
0
15 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
124
1
0
09 Oct 2024
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
H. Zhang
Jingyuan Huang
Kai Mei
Yifei Yao
Zhenting Wang
Chenlu Zhan
Hongwei Wang
Yongfeng Zhang
AAML
LLMAG
ELM
73
30
0
03 Oct 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li
Zhangchen Xu
Fengqing Jiang
Luyao Niu
D. Sahabandu
Bhaskar Ramasubramanian
Radha Poovendran
SILM
AAML
76
7
0
18 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
65
27
0
11 Jun 2024
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
Xijia Tao
Shuai Zhong
Lei Li
Qi Liu
Lingpeng Kong
80
26
0
05 Mar 2024
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem
Omar Mahmoud
Niloofar Mireshghallah
Hyunwoo J. Kim
Yulia Tsvetkov
Yejin Choi
Sherif Saad
Santu Rana
77
20
0
05 Mar 2024
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
87
33
0
01 Feb 2024
1