ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.03819
  4. Cited By
LEACE: Perfect linear concept erasure in closed form
v1v2v3v4 (latest)

LEACE: Perfect linear concept erasure in closed form

6 June 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
    KELMMU
ArXiv (abs)PDFHTML

Papers citing "LEACE: Perfect linear concept erasure in closed form"

50 / 119 papers shown
Title
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao
Yue Niu
Tingting Tang
A. Avestimehr
Murali Annavaram
MU
66
12
0
13 Mar 2024
Guardrail Baselines for Unlearning in LLMs
Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker
Yash Maurya
Shengyuan Hu
Zhiwei Steven Wu
Virginia Smith
MU
94
53
0
05 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
98
192
0
05 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to
  components
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
87
48
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
107
35
0
27 Feb 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
100
21
0
26 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic
  tasks
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora
Daniel Jurafsky
Christopher Potts
61
24
0
19 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh
Shauli Ravfogel
Jonathan Herzig
Roee Aharoni
Ryan Cotterell
Ponnurangam Kumaraguru
LLMSV
57
16
0
15 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback
Suppressing Pink Elephants with Direct Principle Feedback
Louis Castricato
Nathan Lile
Suraj Anand
Hailey Schoelkopf
Siddharth Verma
Stella Biderman
91
12
0
12 Feb 2024
Explaining Text Classifiers with Counterfactual Representations
Explaining Text Classifiers with Counterfactual Representations
Pirmin Lemberger
Antoine Saillenfest
64
0
0
01 Feb 2024
A Comprehensive Study of Knowledge Editing for Large Language Models
A Comprehensive Study of Knowledge Editing for Large Language Models
Ningyu Zhang
Yunzhi Yao
Bo Tian
Peng Wang
Shumin Deng
...
Lei Liang
Qing Cui
Xiao-Jun Zhu
Jun Zhou
Huajun Chen
KELM
99
88
0
02 Jan 2024
Improving Activation Steering in Language Models with Mean-Centring
Improving Activation Steering in Language Models with Mean-Centring
Ole Jorgensen
Dylan R. Cope
Nandi Schoots
Murray Shanahan
LLMSV
41
35
0
06 Dec 2023
The Ethics of Automating Legal Actors
The Ethics of Automating Legal Actors
Josef Valvoda
Alec Thompson
Ryan Cotterell
Simone Teufel
AILawELM
58
1
0
01 Dec 2023
Fuse to Forget: Bias Reduction and Selective Memorization through Model
  Fusion
Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion
Kerem Zaman
Leshem Choshen
Shashank Srivastava
KELMMoMe
74
11
0
13 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
134
7
0
07 Nov 2023
Debiasing Algorithm through Model Adaptation
Debiasing Algorithm through Model Adaptation
Tomasz Limisiewicz
David Marecek
Tomáš Musil
80
14
0
29 Oct 2023
Knowledge Editing for Large Language Models: A Survey
Knowledge Editing for Large Language Models: A Survey
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Wenlin Yao
KELM
130
162
0
24 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender
  Bias in an English Language Model
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
71
18
0
19 Oct 2023
Removing Spurious Concepts from Neural Network Representations via Joint
  Subspace Estimation
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation
Floris Holstege
Bram Wouters
Noud van Giersbergen
C. Diks
76
2
0
18 Oct 2023
Emptying the Ocean with a Spoon: Should We Edit Models?
Emptying the Ocean with a Spoon: Should We Edit Models?
Yuval Pinter
Michael Elhadad
KELM
102
29
0
18 Oct 2023
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in
  the Hidden States of Over-Confident Large Language Models
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models
Aviv Slobodkin
Omer Goldman
Avi Caciularu
Ido Dagan
Shauli Ravfogel
HILMLRM
69
33
0
18 Oct 2023
In-Context Unlearning: Language Models as Few Shot Unlearners
In-Context Unlearning: Language Models as Few Shot Unlearners
Martin Pawelczyk
Seth Neel
Himabindu Lakkaraju
MU
97
131
0
11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model
  Representations of True/False Datasets
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
140
224
0
10 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELMAAML
119
108
0
29 Sep 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
78
201
0
26 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
113
444
0
15 Sep 2023
Benchmarks for Detecting Measurement Tampering
Benchmarks for Detecting Measurement Tampering
Fabien Roger
Ryan Greenblatt
Max Nadeau
Buck Shlegeris
Nate Thomas
61
2
0
29 Aug 2023
A Geometric Notion of Causal Probing
A Geometric Notion of Causal Probing
Clément Guerner
Anej Svete
Tianyu Liu
Alex Warstadt
Ryan Cotterell
LLMSV
94
17
0
27 Jul 2023
Stay on topic with Classifier-Free Guidance
Stay on topic with Classifier-Free Guidance
Guillaume Sanchez
Honglu Fan
Alexander Spangher
Elad Levi
Pawan Sasanka Ammanamanchi
Stella Biderman
3DV
90
55
0
30 Jun 2023
An Overview of Catastrophic AI Risks
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
67
184
0
21 Jun 2023
Editing Large Language Models: Problems, Methods, and Opportunities
Editing Large Language Models: Problems, Methods, and Opportunities
Yunzhi Yao
Peng Wang
Bo Tian
Shuyang Cheng
Zhoubo Li
Shumin Deng
Huajun Chen
Ningyu Zhang
KELM
97
309
0
22 May 2023
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Zhengxuan Wu
Atticus Geiger
Thomas Icard
Christopher Potts
Noah D. Goodman
MILM
77
92
0
15 May 2023
Emergent and Predictable Memorization in Large Language Models
Emergent and Predictable Memorization in Large Language Models
Stella Biderman
USVSN Sai Prashanth
Lintang Sutawika
Hailey Schoelkopf
Quentin G. Anthony
Shivanshu Purohit
Edward Raf
79
125
0
21 Apr 2023
Computational modeling of semantic change
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
92
6
0
13 Apr 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and
  Scaling
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman
Hailey Schoelkopf
Quentin G. Anthony
Herbie Bradley
Kyle O'Brien
...
USVSN Sai Prashanth
Edward Raff
Aviya Skowron
Lintang Sutawika
Oskar van der Wal
107
1,303
0
03 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
73
230
0
14 Mar 2023
Competence-Based Analysis of Language Models
Competence-Based Analysis of Language Models
Adam Davies
Jize Jiang
Chengxiang Zhai
ELM
56
5
0
01 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALMPILM
1.5K
13,437
0
27 Feb 2023
Efficient fair PCA for fair representation learning
Efficient fair PCA for fair representation learning
Matthäus Kleindessner
Michele Donini
Chris Russell
Muhammad Bilal Zafar
FaML
69
16
0
26 Feb 2023
Erasure of Unaligned Attributes from Neural Representations
Erasure of Unaligned Attributes from Neural Representations
Shun Shao
Yftah Ziser
Shay B. Cohen
41
9
0
06 Feb 2023
Better Hit the Nail on the Head than Beat around the Bush: Removing
  Protected Attributes with a Single Projection
Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection
P. Haghighatkhah
Antske Fokkens
Pia Sommerauer
Bettina Speckmann
Kevin Verbeek
64
14
0
08 Dec 2022
Log-linear Guardedness and its Implications
Log-linear Guardedness and its Implications
Shauli Ravfogel
Yoav Goldberg
Ryan Cotterell
91
2
0
18 Oct 2022
Causal Conceptions of Fairness and their Consequences
Causal Conceptions of Fairness and their Consequences
H. Nilforoshan
Johann D. Gaebler
Ravi Shroff
Sharad Goel
FaML
180
46
0
12 Jul 2022
Probing Classifiers are Unreliable for Concept Removal and Detection
Probing Classifiers are Unreliable for Concept Removal and Detection
Abhinav Kumar
Chenhao Tan
Amit Sharma
AAML
72
25
0
08 Jul 2022
Can Transformer be Too Compositional? Analysing Idiom Processing in
  Neural Machine Translation
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation
Verna Dankers
Christopher G. Lucas
Ivan Titov
71
38
0
30 May 2022
Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear
  Guarded Attribute Information
Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information
Shun Shao
Yftah Ziser
Shay B. Cohen
AAML
46
30
0
15 Mar 2022
Kernelized Concept Erasure
Kernelized Concept Erasure
Shauli Ravfogel
Francisco Vargas
Yoav Goldberg
Ryan Cotterell
50
35
0
28 Jan 2022
Linear Adversarial Concept Erasure
Linear Adversarial Concept Erasure
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
119
63
0
28 Jan 2022
Inducing Causal Structure for Interpretable Neural Networks
Inducing Causal Structure for Interpretable Neural Networks
Atticus Geiger
Zhengxuan Wu
Hanson Lu
J. Rozner
Elisa Kreiss
Thomas Icard
Noah D. Goodman
Christopher Potts
CMLOOD
89
73
0
01 Dec 2021
All Bark and No Bite: Rogue Dimensions in Transformer Language Models
  Obscure Representational Quality
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
William Timkey
Marten van Schijndel
283
116
0
09 Sep 2021
Previous
123
Next