Mitigating Deceptive Alignment via Self-Monitoring

24 May 2025

Abstract

Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found atthis http URL

View on arXiv

@article{ji2025_2505.18807,
  title={ Mitigating Deceptive Alignment via Self-Monitoring },
  author={ Jiaming Ji and Wenqi Chen and Kaile Wang and Donghai Hong and Sitong Fang and Boyuan Chen and Jiayi Zhou and Juntao Dai and Sirui Han and Yike Guo and Yaodong Yang },
  journal={arXiv preprint arXiv:2505.18807},
  year={ 2025 }
}

Comments on this paper