Internal Safety Collapse in Frontier Large Language Models

4 March 2026

Yutao Wu

Xiao Liu

Yifeng Gao

Xiang Zheng

Hanxun Huang

Yige Li

Cong Wang

Bo Li

Xingjun Ma

Yu-Gang Jiang

PILM

ELM

ArXiv (abs)PDF HTML Github (483★)

Main:14 Pages

12 Figures

Bibliography:7 Pages

38 Tables

Appendix:30 Pages

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code:this https URL

View on arXiv

Comments on this paper