Kernelized Concept Erasure

28 January 2022

Papers citing "Kernelized Concept Erasure"

31 / 31 papers shown

Title
Fundamental Limits of Perfect Concept Erasure Somnath Basu Roy Chowdhury Avinava Dubey Ahmad Beirami Rahul Kidambi Nicholas Monath Amr Ahmed Snigdha Chaturvedi 61 0 0 25 Mar 2025
Gumbel Counterfactual Generation From Language Models Shauli Ravfogel Anej Svete Vésteinn Snæbjarnarson Ryan Cotterell LRM CML 33 1 0 11 Nov 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip H. S. Torr Francesco Pinto 47 0 0 30 Oct 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Machine Unlearning Fails to Remove Data Poisoning Attacks Martin Pawelczyk Jimmy Z. Di Yiwei Lu Gautam Kamath Ayush Sekhari Seth Neel AAML MU 60 8 0 25 Jun 2024
Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models Dohyun Lee Daniel Rim Minseok Choi Jaegul Choo PILM MU 62 4 0 20 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models Anvesh Rao Vijjini Somnath Basu Roy Chowdhury Snigdha Chaturvedi 51 6 0 17 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 46 115 0 28 Mar 2024
Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information Shadi Iskander Kira Radinsky Yonatan Belinkov 45 4 0 14 Mar 2024
Representation Surgery: Theory and Practice of Affine Steering Shashwat Singh Shauli Ravfogel Jonathan Herzig Roee Aharoni Ryan Cotterell Ponnurangam Kumaraguru LLMSV 27 13 0 15 Feb 2024
Explaining Text Classifiers with Counterfactual Representations Pirmin Lemberger Antoine Saillenfest 39 0 0 01 Feb 2024
The Ethics of Automating Legal Actors Josef Valvoda Alec Thompson Ryan Cotterell Simone Teufel AILaw ELM 24 1 0 01 Dec 2023
Robust Concept Erasure via Kernelized Rate-Distortion Maximization Somnath Basu Roy Chowdhury Nicholas Monath Kumar Avinava Dubey Amr Ahmed Snigdha Chaturvedi 32 4 0 30 Nov 2023
Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions Sachin Kumar Chan Young Park Yulia Tsvetkov VLM 30 2 0 13 Nov 2023
Counterfactually Probing Language Identity in Multilingual Models Anirudh Srinivasan Venkata S Govindarajan Kyle Mahowald 23 1 0 29 Oct 2023
Fair Streaming Principal Component Analysis: Statistical and Algorithmic Viewpoint Junghyun Lee Hanseul Cho Se-Young Yun Chulhee Yun 35 5 0 28 Oct 2023
How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation Marco Gaido Dennis Fucci Matteo Negri L. Bentivogli 37 2 0 23 Oct 2023
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation Floris Holstege Bram Wouters Noud van Giersbergen C. Diks 34 1 0 18 Oct 2023
In-Context Unlearning: Language Models as Few Shot Unlearners Martin Pawelczyk Seth Neel Himabindu Lakkaraju MU 28 100 0 11 Oct 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 41 102 0 06 Jun 2023
Counterfactual Probing for the Influence of Affect and Specificity on Intergroup Bias Venkata S Govindarajan Kyle Mahowald David Beaver J. Li 15 2 0 25 May 2023
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection Shadi Iskander Kira Radinsky Yonatan Belinkov 30 17 0 17 May 2023
Emergent and Predictable Memorization in Large Language Models Stella Biderman USVSN Sai Prashanth Lintang Sutawika Hailey Schoelkopf Quentin G. Anthony Shivanshu Purohit Edward Raf 29 116 0 21 Apr 2023
Competence-Based Analysis of Language Models Adam Davies Jize Jiang Chengxiang Zhai ELM 29 4 0 01 Mar 2023
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models Peter Henderson E. Mitchell Christopher D. Manning Dan Jurafsky Chelsea Finn 23 47 0 27 Nov 2022
Probing Classifiers are Unreliable for Concept Removal and Detection Abhinav Kumar Chenhao Tan Amit Sharma AAML 31 20 0 08 Jul 2022
Naturalistic Causal Probing for Morpho-Syntax Afra Amini Tiago Pimentel Clara Meister Ryan Cotterell MILM 106 18 0 14 May 2022
Probing for the Usage of Grammatical Number Karim Lasri Tiago Pimentel Alessandro Lenci Thierry Poibeau Ryan Cotterell 35 55 0 19 Apr 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 81 57 0 28 Jan 2022
On the Global Optima of Kernelized Adversarial Representation Learning Bashir Sadeghi Runyi Yu Vishnu Naresh Boddeti AAML 67 31 0 16 Oct 2019
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 275 31,267 0 16 Jan 2013