Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.03819
Cited By
v1
v2
v3
v4 (latest)
LEACE: Perfect linear concept erasure in closed form
6 June 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"LEACE: Perfect linear concept erasure in closed form"
50 / 119 papers shown
Title
Precise In-Parameter Concept Erasure in Large Language Models
Yoav Gur-Arieh
Clara Suslik
Yihuai Hong
Fazl Barez
Mor Geva
KELM
MU
79
0
0
28 May 2025
Improved Representation Steering for Language Models
Zhengxuan Wu
Qinan Yu
Aryaman Arora
Christopher D. Manning
Christopher Potts
LLMSV
55
0
0
27 May 2025
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Hongzheng Yang
Yongqiang Chen
Zeyu Qin
Tongliang Liu
Chaowei Xiao
Kun Zhang
Bo Han
LLMSV
32
0
0
24 May 2025
Quiet Feature Learning in Algorithmic Tasks
Prudhviraj Naidu
Zixian Wang
Leon Bergen
R. Paturi
VLM
104
0
0
06 May 2025
DetoxAI: a Python Toolkit for Debiasing Deep Learning Models in Computer Vision
Ignacy Stepka
Lukasz Sztukiewicz
Michał Wiliński
Jerzy Stefanowski
46
0
0
02 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Jen-tse Huang
Joey Tianyi Zhou
AAML
MU
250
4
0
01 May 2025
Probing then Editing Response Personality of Large Language Models
Tianjie Ju
Zhenyu Shao
Binghai Wang
Yulin Chen
Zhuosheng Zhang
Hao Fei
Mong Li Lee
Wynne Hsu
Sufeng Duan
Gongshen Liu
KELM
110
2
0
14 Apr 2025
Fundamental Limits of Perfect Concept Erasure
Somnath Basu Roy Chowdhury
Avinava Dubey
Ahmad Beirami
Rahul Kidambi
Nicholas Monath
Amr Ahmed
Snigdha Chaturvedi
87
1
0
25 Mar 2025
Controlled Model Debiasing through Minimal and Interpretable Updates
Federico Di Gennaro
Thibault Laugel
Vincent Grari
Marcin Detyniecki
FaML
107
0
0
28 Feb 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
155
2
0
24 Feb 2025
Analyzing the Inner Workings of Transformers in Compositional Generalization
Ryoma Kumon
Hitomi Yanaka
87
0
0
24 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
94
6
0
24 Feb 2025
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment
Pegah Khayatan
Mustafa Shukor
Jayneel Parekh
Matthieu Cord
LLMSV
96
1
0
06 Jan 2025
Representation in large language models
Cameron C. Yetman
82
1
0
03 Jan 2025
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
116
4
0
17 Dec 2024
Controllable Context Sensitivity and the Knob Behind It
Julian Minder
Kevin Du
Niklas Stoehr
Giovanni Monea
Chris Wendler
Robert West
Ryan Cotterell
KELM
114
6
0
11 Nov 2024
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip Torr
Francesco Pinto
111
0
0
30 Oct 2024
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Neale Ratzlaff
Matthew Lyle Olson
Musashi Hinck
Shao-Yen Tseng
Vasudev Lal
Phillip Howard
103
0
0
17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
128
28
0
15 Oct 2024
LLM Unlearning via Loss Adjustment with Only Forget Data
Yaxuan Wang
Jiaheng Wei
Chris Yuhao Liu
Jinlong Pang
Qiang Liu
A. Shah
Yujia Bao
Yang Liu
Wei Wei
KELM
MU
143
19
0
14 Oct 2024
Robust AI-Generated Text Detection by Restricted Embeddings
Kristian Kuznetsov
Eduard Tulchinskii
Laida Kushnareva
German Magai
Serguei Barannikov
Sergey I. Nikolenko
Irina Piontkovskaya
DeLMO
60
3
0
10 Oct 2024
Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models
Vinith Suriyakumar
Rohan Alur
Ayush Sekhari
Manish Raghavan
Ashia Wilson
97
4
0
10 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
116
2
0
06 Oct 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
91
3
0
16 Sep 2024
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás
Christopher Potts
Christopher D. Manning
Atticus Geiger
GAN
72
21
0
20 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
112
25
0
02 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
109
62
0
01 Aug 2024
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Weijia Shi
Jaechan Lee
Yangsibo Huang
Sadhika Malladi
Jieyu Zhao
Ari Holtzman
Daogao Liu
Luke Zettlemoyer
Noah A. Smith
Chiyuan Zhang
MU
ELM
81
83
0
08 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
56
10
0
05 Jul 2024
Machine Unlearning Fails to Remove Data Poisoning Attacks
Martin Pawelczyk
Jimmy Z. Di
Yiwei Lu
Gautam Kamath
Ayush Sekhari
Seth Neel
AAML
MU
128
17
0
25 Jun 2024
Towards a Science Exocortex
Kevin G. Yager
103
2
0
24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li
Zheng-Xin Yong
Stephen H. Bach
CLL
78
18
0
23 Jun 2024
Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models
Dohyun Lee
Daniel Rim
Minseok Choi
Jaegul Choo
PILM
MU
98
5
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
127
213
0
17 Jun 2024
In-Context Editing: Learning Knowledge from Self-Induced Distributions
Siyuan Qi
Bangcheng Yang
Kailin Jiang
Xiaobo Wang
Jiaqi Li
Yifan Zhong
Yaodong Yang
Zilong Zheng
KELM
155
10
0
17 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini
Somnath Basu Roy Chowdhury
Snigdha Chaturvedi
133
9
0
17 Jun 2024
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models
Zhuoran Jin
Pengfei Cao
Chenhao Wang
Zhitao He
Hongbang Yuan
Jiachun Li
Yubo Chen
Kang Liu
Jun Zhao
KELM
MU
118
25
0
16 Jun 2024
On the Encoding of Gender in Transformer-based ASR Representations
Aravind Krishnan
Badr M. Abdullah
Dietrich Klakow
65
4
0
14 Jun 2024
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Bar Iluz
Yanai Elazar
Asaf Yehudai
Gabriel Stanovsky
49
2
0
02 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
125
15
0
26 May 2024
Linearly Controlled Language Generation with Performative Guarantees
Emily Cheng
Marco Baroni
Carmen Amo Alonso
92
3
0
24 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
62
41
0
14 May 2024
Automating Thematic Analysis: How LLMs Analyse Controversial Topics
Awais Hameed Khan
H. Kegalle
Rhea D'Silva
Ned Watt
Daniel Whelan-Shamy
Lida Ghahremanlou
Liam Magee
80
7
0
11 May 2024
Utility-Fairness Trade-Offs and How to Find Them
Sepehr Dehdashtian
Bashir Sadeghi
Vishnu Boddeti
65
6
0
15 Apr 2024
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
102
71
0
04 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
133
158
0
28 Mar 2024
Can Large Language Models (or Humans) Disentangle Text?
Nicolas Audinet de Pieuchon
Adel Daoud
Connor Jerzak
Moa Johansson
Richard Johansson
66
0
0
25 Mar 2024
What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
Richard Johansson
54
0
0
24 Mar 2024
Detoxifying Large Language Models via Knowledge Editing
Meng Wang
Ningyu Zhang
Ziwen Xu
Zekun Xi
Shumin Deng
Yunzhi Yao
Qishen Zhang
Linyi Yang
Jindong Wang
Huajun Chen
KELM
87
66
0
21 Mar 2024
Towards a theory of model distillation
Enric Boix-Adserà
FedML
VLM
69
8
0
14 Mar 2024
1
2
3
Next