Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2201.12091
Cited By
Linear Adversarial Concept Erasure
28 January 2022
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Linear Adversarial Concept Erasure"
20 / 20 papers shown
Title
Controllable Context Sensitivity and the Knob Behind It
Julian Minder
Kevin Du
Niklas Stoehr
Giovanni Monea
Chris Wendler
Robert West
Ryan Cotterell
KELM
44
3
0
11 Nov 2024
Collapsed Language Models Promote Fairness
Jingxuan Xu
Wuyang Chen
Linyi Li
Yao Zhao
Yunchao Wei
39
0
0
06 Oct 2024
Machine Unlearning Fails to Remove Data Poisoning Attacks
Martin Pawelczyk
Jimmy Z. Di
Yiwei Lu
Gautam Kamath
Ayush Sekhari
Seth Neel
AAML
MU
54
8
0
25 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
44
111
0
28 Mar 2024
Robust Concept Erasure via Kernelized Rate-Distortion Maximization
Somnath Basu Roy Chowdhury
Nicholas Monath
Kumar Avinava Dubey
Amr Ahmed
Snigdha Chaturvedi
19
4
0
30 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
70
7
0
07 Nov 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
23
332
0
15 Sep 2023
A Survey on Fairness in Large Language Models
Yingji Li
Mengnan Du
Rui Song
Xin Wang
Ying Wang
ALM
41
59
0
20 Aug 2023
A Geometric Notion of Causal Probing
Clément Guerner
Anej Svete
Tianyu Liu
Alex Warstadt
Ryan Cotterell
LLMSV
36
12
0
27 Jul 2023
Prompt Tuning Pushes Farther, Contrastive Learning Pulls Closer: A Two-Stage Approach to Mitigate Social Biases
Yingji Li
Mengnan Du
Xin Wang
Ying Wang
45
26
0
04 Jul 2023
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
41
102
0
06 Jun 2023
DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue
William B. Held
Christopher Hidey
Fei Liu
Eric Zhu
Rahul Goel
Diyi Yang
Rushin Shah
21
0
0
15 Dec 2022
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
Peter Henderson
E. Mitchell
Christopher D. Manning
Dan Jurafsky
Chelsea Finn
16
47
0
27 Nov 2022
Log-linear Guardedness and its Implications
Shauli Ravfogel
Yoav Goldberg
Ryan Cotterell
28
2
0
18 Oct 2022
Visual Comparison of Language Model Adaptation
R. Sevastjanova
E. Cakmak
Shauli Ravfogel
Ryan Cotterell
Mennatallah El-Assady
VLM
33
16
0
17 Aug 2022
Analyzing Gender Representation in Multilingual Models
Hila Gonen
Shauli Ravfogel
Yoav Goldberg
15
11
0
20 Apr 2022
Debiasing Pre-trained Contextualised Embeddings
Masahiro Kaneko
Danushka Bollegala
210
138
0
23 Jan 2021
On the Global Optima of Kernelized Adversarial Representation Learning
Bashir Sadeghi
Runyi Yu
Vishnu Naresh Boddeti
AAML
59
31
0
16 Oct 2019
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras
S. Laine
Timo Aila
264
10,348
0
12 Dec 2018
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification
Xilun Chen
Yu Sun
Ben Athiwaratkun
Claire Cardie
Kilian Q. Weinberger
217
315
0
06 Jun 2016
1