Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

15 March 2022

Papers citing "Gold Doesn't Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information"

8 / 8 papers shown

Title
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 64 18 0 15 Oct 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 78 7 0 07 Nov 2023
A Joint Matrix Factorization Analysis of Multilingual Representations Zheng Zhao Yftah Ziser Bonnie Webber Shay B. Cohen 32 2 0 24 Oct 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 41 102 0 06 Jun 2023
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection Shadi Iskander Kira Radinsky Yonatan Belinkov 44 17 0 17 May 2023
Better Hit the Nail on the Head than Beat around the Bush: Removing Protected Attributes with a Single Projection P. Haghighatkhah Antske Fokkens Pia Sommerauer Bettina Speckmann Kevin Verbeek 32 10 0 08 Dec 2022
Log-linear Guardedness and its Implications Shauli Ravfogel Yoav Goldberg Ryan Cotterell 28 2 0 18 Oct 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 84 57 0 28 Jan 2022