ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

21 April 2026

Kun Wang

Cheng Qian

Miao Yu

Lilan Peng

Liang Lin

Jiaming Zhang

Tianyu Zhang

Yu Cheng

Yang Wang

MLLM

AAML

ArXiv (abs)PDF HTML Github

Main:9 Pages

17 Figures

Bibliography:2 Pages

5 Tables

Appendix:7 Pages

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at:this https URL

View on arXiv

Comments on this paper