69
4
v1v2 (latest)

Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning

Abstract

Instruction fine-tuning has emerged as a critical technique for customizing Large Language Models (LLMs) to specific applications. However, recent studies have highlighted significant security vulnerabilities in fine-tuned LLMs. Existing defense efforts focus more on pre-training and post-training methods, yet there remains underexplored in in-training methods. To fill this gap, we introduce a novel secure-tuning strategy called SWAT. By analyzing how module-level parameters (e.g. Q/K/V/O) affect the security feature space drift, we identify a robust subset of modules, termed Mods_Rob. Our SWAT strategy begins by warming up Mods_Rob to capture low-level features with minimal security risks, followed by training all parameters to achieve optimal task performance. Essentially, this strategy shifts the early learning burden more from global parameters to Mods_Rob, reducing update magnitudes of the non-robust subset. Across various datasets, scenarios, and LLMs, our strategy has demonstrated significant success in mitigating security risks while preserving task performance. Importantly, it can be seamlessly integrated with pre-training and post-training methods, leading to greater improvements.

View on arXiv
@article{du2025_2410.04524,
  title={ Toward Secure Tuning: Mitigating Security Risks from Instruction Fine-Tuning },
  author={ Yanrui Du and Sendong Zhao and Jiawei Cao and Ming Ma and Danyang Zhao and Shuren Qi and Fenglei Fan and Ting Liu and Bing Qin },
  journal={arXiv preprint arXiv:2410.04524},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.