v1v2v3v4 (latest)

LEACE: Perfect linear concept erasure in closed form

6 June 2023

Nora Belrose

David Schneider-Joseph

Papers citing "LEACE: Perfect linear concept erasure in closed form"

50 / 119 papers shown

Title
Ethos: Rectifying Language Models in Orthogonal Parameter Space Lei Gao Yue Niu Tingting Tang A. Avestimehr Murali Annavaram MU 66 12 0 13 Mar 2024
Guardrail Baselines for Unlearning in LLMs Pratiksha Thaker Yash Maurya Shengyuan Hu Zhiwei Steven Wu Virginia Smith MU 94 53 0 05 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Nathaniel Li Alexander Pan Anjali Gopal Summer Yue Daniel Berrios ... Yan Shoshitaishvili Jimmy Ba K. Esvelt Alexandr Wang Dan Hendrycks ELM 98 192 0 05 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 87 48 0 01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 107 35 0 27 Feb 2024
Immunization against harmful fine-tuning attacks Domenic Rosati Jan Wehner Kai Williams Lukasz Bartoszcze Jan Batzner Hassan Sajjad Frank Rudzicz AAML 100 21 0 26 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks Aryaman Arora Daniel Jurafsky Christopher Potts 61 24 0 19 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering Shashwat Singh Shauli Ravfogel Jonathan Herzig Roee Aharoni Ryan Cotterell Ponnurangam Kumaraguru LLMSV 57 16 0 15 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback Louis Castricato Nathan Lile Suraj Anand Hailey Schoelkopf Siddharth Verma Stella Biderman 91 12 0 12 Feb 2024
Explaining Text Classifiers with Counterfactual Representations Pirmin Lemberger Antoine Saillenfest 64 0 0 01 Feb 2024
A Comprehensive Study of Knowledge Editing for Large Language Models Ningyu Zhang Yunzhi Yao Bo Tian Peng Wang Shumin Deng ... Lei Liang Qing Cui Xiao-Jun Zhu Jun Zhou Huajun Chen KELM 99 88 0 02 Jan 2024
Improving Activation Steering in Language Models with Mean-Centring Ole Jorgensen Dylan R. Cope Nandi Schoots Murray Shanahan LLMSV 41 35 0 06 Dec 2023
The Ethics of Automating Legal Actors Josef Valvoda Alec Thompson Ryan Cotterell Simone Teufel AILaw ELM 58 1 0 01 Dec 2023
Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion Kerem Zaman Leshem Choshen Shashank Srivastava KELM MoMe 74 11 0 13 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 134 7 0 07 Nov 2023
Debiasing Algorithm through Model Adaptation Tomasz Limisiewicz David Marecek Tomáš Musil 80 14 0 29 Oct 2023
Knowledge Editing for Large Language Models: A Survey Song Wang Yaochen Zhu Haochen Liu Zaiyi Zheng Chen Chen Wenlin Yao KELM 130 162 0 24 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model Abhijith Chintam Rahel Beloch Willem H. Zuidema Michael Hanna Oskar van der Wal 71 18 0 19 Oct 2023
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation Floris Holstege Bram Wouters Noud van Giersbergen C. Diks 76 2 0 18 Oct 2023
Emptying the Ocean with a Spoon: Should We Edit Models? Yuval Pinter Michael Elhadad KELM 102 29 0 18 Oct 2023
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models Aviv Slobodkin Omer Goldman Avi Caciularu Ido Dagan Shauli Ravfogel HILM LRM 69 33 0 18 Oct 2023
In-Context Unlearning: Language Models as Few Shot Unlearners Martin Pawelczyk Seth Neel Himabindu Lakkaraju MU 97 131 0 11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 140 224 0 10 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks Vaidehi Patil Peter Hase Joey Tianyi Zhou KELM AAML 119 108 0 29 Sep 2023
Large Language Model Alignment: A Survey Tianhao Shen Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu Deyi Xiong LM&MA 78 201 0 26 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 113 444 0 15 Sep 2023
Benchmarks for Detecting Measurement Tampering Fabien Roger Ryan Greenblatt Max Nadeau Buck Shlegeris Nate Thomas 61 2 0 29 Aug 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 94 17 0 27 Jul 2023
Stay on topic with Classifier-Free Guidance Guillaume Sanchez Honglu Fan Alexander Spangher Elad Levi Pawan Sasanka Ammanamanchi Stella Biderman 3DV 90 55 0 30 Jun 2023
An Overview of Catastrophic AI Risks Dan Hendrycks Mantas Mazeika Thomas Woodside SILM 67 184 0 21 Jun 2023
Editing Large Language Models: Problems, Methods, and Opportunities Yunzhi Yao Peng Wang Bo Tian Shuyang Cheng Zhoubo Li Shumin Deng Huajun Chen Ningyu Zhang KELM 97 309 0 22 May 2023
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu Atticus Geiger Thomas Icard Christopher Potts Noah D. Goodman MILM 77 92 0 15 May 2023
Emergent and Predictable Memorization in Large Language Models Stella Biderman USVSN Sai Prashanth Lintang Sutawika Hailey Schoelkopf Quentin G. Anthony Shivanshu Purohit Edward Raf 79 125 0 21 Apr 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 92 6 0 13 Apr 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling Stella Biderman Hailey Schoelkopf Quentin G. Anthony Herbie Bradley Kyle O'Brien ... USVSN Sai Prashanth Edward Raff Aviya Skowron Lintang Sutawika Oskar van der Wal 107 1,303 0 03 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 73 230 0 14 Mar 2023
Competence-Based Analysis of Language Models Adam Davies Jize Jiang Chengxiang Zhai ELM 56 5 0 01 Mar 2023
LLaMA: Open and Efficient Foundation Language Models Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux ... Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave Guillaume Lample ALM PILM 1.5K 13,437 0 27 Feb 2023
Efficient fair PCA for fair representation learning Matthäus Kleindessner Michele Donini Chris Russell Muhammad Bilal Zafar FaML 69 16 0 26 Feb 2023
Erasure of Unaligned Attributes from Neural Representations Shun Shao Yftah Ziser Shay B. Cohen 41 9 0 06 Feb 2023
Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection P. Haghighatkhah Antske Fokkens Pia Sommerauer Bettina Speckmann Kevin Verbeek 64 14 0 08 Dec 2022
Log-linear Guardedness and its Implications Shauli Ravfogel Yoav Goldberg Ryan Cotterell 91 2 0 18 Oct 2022
Causal Conceptions of Fairness and their Consequences H. Nilforoshan Johann D. Gaebler Ravi Shroff Sharad Goel FaML 180 46 0 12 Jul 2022
Probing Classifiers are Unreliable for Concept Removal and Detection Abhinav Kumar Chenhao Tan Amit Sharma AAML 72 25 0 08 Jul 2022
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation Verna Dankers Christopher G. Lucas Ivan Titov 71 38 0 30 May 2022
Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information Shun Shao Yftah Ziser Shay B. Cohen AAML 46 30 0 15 Mar 2022
Kernelized Concept Erasure Shauli Ravfogel Francisco Vargas Yoav Goldberg Ryan Cotterell 50 35 0 28 Jan 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 119 63 0 28 Jan 2022
Inducing Causal Structure for Interpretable Neural Networks Atticus Geiger Zhengxuan Wu Hanson Lu J. Rozner Elisa Kreiss Thomas Icard Noah D. Goodman Christopher Potts CML OOD 89 73 0 01 Dec 2021
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality William Timkey Marten van Schijndel 283 116 0 09 Sep 2021