Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.22586
Cited By
Precise In-Parameter Concept Erasure in Large Language Models
28 May 2025
Yoav Gur-Arieh
Clara Suslik
Yihuai Hong
Fazl Barez
Mor Geva
KELM
MU
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Precise In-Parameter Concept Erasure in Large Language Models"
50 / 50 papers shown
Title
SAEs
Can
\textit{Can}
Can
Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
97
4
0
11 Apr 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Yuheng Chen
Pengfei Cao
Kang Liu
Jun Zhao
66
1
0
18 Feb 2025
Open Problems in Machine Unlearning for AI Safety
Fazl Barez
Tingchen Fu
Ameya Prabhu
Stephen Casper
Amartya Sanyal
...
David M. Krueger
Sören Mindermann
José Hernandez-Orallo
Mor Geva
Y. Gal
MU
66
19
0
10 Jan 2025
Inferring Functionality of Attention Heads from their Parameters
Amit Elhelo
Mor Geva
100
3
0
16 Dec 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He
Wentao Shu
Xuyang Ge
Lingjie Chen
Junxuan Wang
...
Qipeng Guo
Xuanjing Huang
Zuxuan Wu
Yu-Gang Jiang
Xipeng Qiu
74
23
0
27 Oct 2024
Applying sparse autoencoders to unlearn knowledge in language models
Eoin Farrell
Yeu-Tong Lau
Arthur Conmy
MU
69
20
0
25 Oct 2024
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAML
MU
65
21
0
11 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Pratiksha Thaker
Shengyuan Hu
Neil Kale
Yash Maurya
Zhiwei Steven Wu
Virginia Smith
MU
87
13
0
03 Oct 2024
Erasing Conceptual Knowledge from Language Models
Rohit Gandikota
Sheridan Feucht
Samuel Marks
David Bau
KELM
ELM
MU
75
8
0
03 Oct 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
113
38
0
26 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
64
106
0
09 Aug 2024
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng Jiang
MU
77
16
0
30 Jul 2024
Speech Representation Analysis based on Inter- and Intra-Model Similarities
Yassine El Kheir
Ahmed M. Ali
Shammur A. Chowdhury
SSL
65
3
0
23 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
85
21
0
17 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
79
97
0
06 Jun 2024
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang
Licong Lin
Yu Bai
Song Mei
MU
100
150
0
08 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
103
137
0
28 Mar 2024
Threats, Attacks, and Defenses in Machine Unlearning: A Survey
Ziyao Liu
Huanyi Ye
Chen Chen
Yongsen Zheng
K. Lam
AAML
MU
63
30
0
20 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
85
32
0
27 Feb 2024
WilKE: Wise-Layer Knowledge Editor for Lifelong Knowledge Editing
Chenhui Hu
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
KELM
CLL
37
29
0
16 Feb 2024
Large Language Models Relearn Removed Concepts
Michelle Lo
Shay B. Cohen
Fazl Barez
KELM
48
19
0
03 Jan 2024
Towards more Practical Threat Models in Artificial Intelligence Security
Kathrin Grosse
L. Bieringer
Tarek R. Besold
Alexandre Alahi
55
13
0
16 Nov 2023
Who's Harry Potter? Approximate Unlearning in LLMs
Ronen Eldan
M. Russinovich
MU
MoMe
125
191
0
03 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
75
382
0
15 Sep 2023
Neurons in Large Language Models: Dead, N-gram, Positional
Elena Voita
Javier Ferrando
Christoforos Nalmpantis
MILM
83
54
0
09 Sep 2023
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
60
107
0
06 Jun 2023
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection
Shadi Iskander
Kira Radinsky
Yonatan Belinkov
98
18
0
17 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
180
203
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
235
297
0
28 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas Icard
Noah D. Goodman
CML
85
105
0
05 Mar 2023
Mass-Editing Memory in a Transformer
Kevin Meng
Arnab Sen Sharma
A. Andonian
Yonatan Belinkov
David Bau
KELM
VLM
99
543
0
13 Oct 2022
Analyzing Encoded Concepts in Transformer Language Models
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
Firoj Alam
A. Khan
Jia Xu
37
44
0
27 Jun 2022
Discovering Latent Concepts Learned in BERT
Fahim Dalvi
A. Khan
Firoj Alam
Nadir Durrani
Jia Xu
Hassan Sajjad
SSL
41
58
0
15 May 2022
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
Mor Geva
Avi Caciularu
Ke Wang
Yoav Goldberg
KELM
87
358
0
28 Mar 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
166
1,308
0
10 Feb 2022
Neuron-level Interpretation of Deep NLP Models: A Survey
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
MILM
AI4CE
57
83
0
30 Aug 2021
Pay Attention to MLPs
Hanxiao Liu
Zihang Dai
David R. So
Quoc V. Le
AI4CE
92
657
0
17 May 2021
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
115
792
0
29 Dec 2020
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
D. Song
Jacob Steinhardt
ELM
RALM
143
4,222
0
07 Sep 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
533
41,106
0
28 May 2020
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Shauli Ravfogel
Yanai Elazar
Hila Gonen
Michael Twiton
Yoav Goldberg
95
378
0
16 Apr 2020
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
Adam Roberts
Colin Raffel
Noam M. Shazeer
KELM
79
886
0
10 Feb 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
466
4,662
0
23 Jan 2020
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
543
2,639
0
03 Sep 2019
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers
Iryna Gurevych
806
11,979
0
27 Aug 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
430
24,160
0
26 Jul 2019
Making AI Forget You: Data Deletion in Machine Learning
Antonio A. Ginart
M. Guan
Gregory Valiant
James Zou
MU
62
467
0
11 Jul 2019
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar
Robin Jia
Percy Liang
RALM
ELM
206
2,830
0
11 Jun 2018
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Tolga Bolukbasi
Kai-Wei Chang
James Zou
Venkatesh Saligrama
Adam Kalai
CVBM
FaML
69
3,115
0
21 Jul 2016
Building high-level features using large scale unsupervised learning
Quoc V. Le
MarcÁurelio Ranzato
R. Monga
M. Devin
Kai Chen
G. Corrado
J. Dean
A. Ng
SSL
OffRL
CVBM
104
2,268
0
29 Dec 2011
1