Do Unlearning Methods Remove Information from Language Model Weights?

11 October 2024

Papers citing "Do Unlearning Methods Remove Information from Language Model Weights?"

10 / 10 papers shown

Title
Layered Unlearning for Adversarial Relearning Timothy Qian Vinith Suriyakumar Ashia Wilson Dylan Hadfield-Menell MU 28 0 0 14 May 2025
Access Controls Will Solve the Dual-Use Dilemma Evžen Wybitul AAML 26 0 0 14 May 2025
$SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs$ SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs Aashiq Muhamed Jacopo Bonato Mona Diab Virginia Smith MU 66 0 0 11 Apr 2025
Not All Data Are Unlearned Equally Aravind Krishnan Siva Reddy Marius Mosbach MU 166 1 0 07 Apr 2025
Exact Unlearning of Finetuning Data via Model Merging at Scale Kevin Kuo Amrith Rajagopal Setlur Kartik Srinivas Aditi Raghunathan Virginia Smith MoMe CLL MU 45 0 0 06 Apr 2025
A General Framework to Enhance Fine-tuning-based LLM Unlearning J. Ren Zhenwei Dai Xianfeng Tang Hui Liu Jingying Zeng ... R. Goutam Suhang Wang Yue Xing Qi He Hui Liu MU 163 1 0 25 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che Stephen Casper Robert Kirk Anirudh Satheesh Stewart Slocum ... Zikui Cai Bilal Chughtai Y. Gal Furong Huang Dylan Hadfield-Menell MU AAML ELM 85 3 0 03 Feb 2025
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Guo Aaquib Syed Abhay Sheshadri Aidan Ewart Gintare Karolina Dziugaite KELM MU 41 5 0 16 Oct 2024
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki Boyi Wei Yangsibo Huang Peter Henderson F. Tramèr Javier Rando MU AAML 73 32 0 26 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024