Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
- AAML

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose , a trans model that updates an initial tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on LLMs from different families and recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto . Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
View on arXiv@article{sharma2025_2506.15751, title={ Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts }, author={ Kartik Sharma and Yiqiao Jin and Vineeth Rakesh and Yingtong Dou and Menghai Pan and Mahashweta Das and Srijan Kumar }, journal={arXiv preprint arXiv:2506.15751}, year={ 2025 } }