Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2506.19823
Cited By
v1
v2 (latest)
Persona Features Control Emergent Misalignment
24 June 2025
Miles Wang
Tom Dupré la Tour
Olivia Watkins
Alex Makelov
Ryan A. Chi
Samuel Miserendino
Jeffrey Wang
Achyuta Rajaram
Johannes Heidecke
Tejal Patwardhan
Dan Mossing
Re-assign community
ArXiv (abs)
PDF
HTML
Github (39★)
Papers citing
"Persona Features Control Emergent Misalignment"
12 / 12 papers shown
Title
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong
Stephen H. Bach
LRM
192
0
0
23 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents
Sarah Egler
John Schulman
Nicholas Carlini
AAML
MLAU
137
0
0
17 Oct 2025
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?
Leonard Dung
Florian Mai
72
0
0
13 Oct 2025
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Xuhao Hu
Peng Wang
Xiaoya Lu
Dongrui Liu
Xuanjing Huang
Jing Shao
100
1
0
09 Oct 2025
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
Pengrui Han
Rafal Kocielnik
Peiyang Song
Ramit Debnath
Dean Mobbs
Anima Anandkumar
R. Alvarez
269
4
0
03 Sep 2025
When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Hanqi Yan
Hainiu Xu
Siya Qi
Shu Yang
Yulan He
LRM
145
2
0
30 Aug 2025
Decomposing Behavioral Phase Transitions in LLMs: Order Parameters for Emergent Misalignment
Julian Arnold
Niels Lörch
82
1
0
27 Aug 2025
Jinx: Unlimited LLMs for Probing Alignment Failures
Jiahao Zhao
Liwei Dong
84
0
0
11 Aug 2025
Training language models to be warm and empathetic makes them less reliable and more sycophantic
Lujain Ibrahim
Franziska Sofia Hafner
Luc Rocher
139
7
0
29 Jul 2025
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models
Xingcheng Xu
156
0
0
27 Jul 2025
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
Alex Cloud
Minh Le
James Chua
Jan Betley
Anna Sztyber-Betley
Jacob Hilton
Samuel Marks
Owain Evans
141
22
0
20 Jul 2025
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan
Zheng-Xin Yong
Stephen H. Bach
LRM
128
7
0
16 Jul 2025
1