CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

29 May 2025

Main:9 Pages

10 Figures

Bibliography:3 Pages

6 Tables

Appendix:10 Pages

Abstract

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.

View on arXiv

@article{arnav2025_2505.23575,
  title={ CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring },
  author={ Benjamin Arnav and Pablo Bernabeu-Pérez and Nathan Helm-Burger and Tim Kostolansky and Hannes Whittingham and Mary Phuong },
  journal={arXiv preprint arXiv:2505.23575},
  year={ 2025 }
}

Comments on this paper