38
0

Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization

Main:5 Pages
10 Figures
Bibliography:2 Pages
2 Tables
Appendix:12 Pages
Abstract

Language models can retain dangerous knowledge and skills even after extensive safety fine-tuning, posing both misuse and misalignment risks. Recent studies show that even specialized unlearning methods can be easily reversed. To address this, we systematically evaluate many existing and novel components of unlearning methods and identify ones crucial for irreversible unlearning.

View on arXiv
@article{sondej2025_2506.12484,
  title={ Robust LLM Unlearning with MUDMAN: Meta-Unlearning with Disruption Masking And Normalization },
  author={ Filip Sondej and Yushi Yang and Mikołaj Kniejski and Marcel Windys },
  journal={arXiv preprint arXiv:2506.12484},
  year={ 2025 }
}
Comments on this paper