72
0

IFGUIDEIF-GUIDE: Influence Function-Guided Detoxification of LLMs

Main:9 Pages
9 Figures
Bibliography:6 Pages
7 Tables
Appendix:7 Pages
Abstract

We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts reactivereactive approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a proactiveproactive approach-IF-Guide-which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity-by up to 10×\times compared to uncensored models, and up to 3×\times compared to baseline alignment methods, e.g., DPO and RAD-across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is notnot necessarynecessary for computing influence scores; a million-parameter model-with 7.5×\times fewer parameters-can effectively serve as a proxy for identifying harmful data.

View on arXiv
@article{coalson2025_2506.01790,
  title={ IF-GUIDE: Influence Function-Guided Detoxification of LLMs },
  author={ Zachary Coalson and Juhan Bae and Nicholas Carlini and Sanghyun Hong },
  journal={arXiv preprint arXiv:2506.01790},
  year={ 2025 }
}
Comments on this paper