Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

23 May 2025

Abstract

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.

View on arXiv

@article{lu2025_2505.17650,
  title={ Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking? },
  author={ Chengda Lu and Xiaoyu Fan and Yu Huang and Rongwu Xu and Jijie Li and Wei Xu },
  journal={arXiv preprint arXiv:2505.17650},
  year={ 2025 }
}

Comments on this paper