Title

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs

24 August 2025

Papers citing "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs"

2 / 2 papers shown

Title
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions Xuhao Hu Peng Wang Xiaoya Lu Dongrui Liu Xuanjing Huang Jing Shao 116 1 0 09 Oct 2025
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time Daniel Tan Anders Woodruff Niels Warncke Arun Jose Maxime Riché David Demitri Africa Mia Taylor 291 0 0 05 Oct 2025