Probing Classifiers are Unreliable for Concept Removal and Detection

Probing Classifiers are Unreliable for Concept Removal and Detection

8 July 2022

Papers citing "Probing Classifiers are Unreliable for Concept Removal and Detection"

12 / 12 papers shown

Title
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 102 0 0 24 Feb 2025
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 40 10 0 27 Jul 2024
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 41 12 0 27 Jul 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 41 102 0 06 Jun 2023
Target-Agnostic Gender-Aware Contrastive Learning for Mitigating Bias in Multilingual Machine Translation Minwoo Lee Hyukhun Koh Kang-il Lee Dongdong Zhang Minsu Kim Kyomin Jung 35 9 0 23 May 2023
Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection Shadi Iskander Kira Radinsky Yonatan Belinkov 38 17 0 17 May 2023
Towards Procedural Fairness: Uncovering Biases in How a Toxic Language Classifier Uses Sentiment Information I. Nejadgholi Esma Balkir Kathleen C. Fraser S. Kiritchenko 40 3 0 19 Oct 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 81 57 0 28 Jan 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021
An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa Aditi Raghunathan Pang Wei Koh Percy Liang 152 371 0 09 May 2020
A Survey on Bias and Fairness in Machine Learning Ninareh Mehrabi Fred Morstatter N. Saxena Kristina Lerman Aram Galstyan SyDa FaML 326 4,223 0 23 Aug 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 201 882 0 03 May 2018