Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

14 June 2025

Filip Sondej

Main:5 Pages

10 Figures

Bibliography:2 Pages

2 Tables

Appendix:12 Pages

Abstract

Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning.

View on arXiv

@article{sondej2025_2506.12484,
  title={ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization },
  author={ Filip Sondej and Yushi Yang and Mikołaj Kniejski and Marcel Windys },
  journal={arXiv preprint arXiv:2506.12484},
  year={ 2025 }
}

Comments on this paper