56
v1v3 (latest)

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt
Sarthak Munshi
Vineeth Sai Narajala
Idan Habler
Ammar Al-Kahfah
Ken Huang
Joel Webb
Blake Gatto
Md Tamjidul Hoque
Main:13 Pages
4 Figures
Bibliography:3 Pages
5 Tables
Appendix:3 Pages
Abstract

We prove that no continuous, utility-preserving wrapper defense-a function D:XXD: X\to X that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an ϵ\epsilon-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

View on arXiv
Comments on this paper