Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2508.17511
Cited By
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs
24 August 2025
Mia Taylor
James Chua
Jan Betley
Johannes Treutlein
Owain Evans
Re-assign community
ArXiv (abs)
PDF
HTML
Github
Papers citing
"School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs"
2 / 2 papers shown
Title
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Xuhao Hu
Peng Wang
Xiaoya Lu
Dongrui Liu
Xuanjing Huang
Jing Shao
116
1
0
09 Oct 2025
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Daniel Tan
Anders Woodruff
Niels Warncke
Arun Jose
Maxime Riché
David Demitri Africa
Mia Taylor
291
0
0
05 Oct 2025
1