Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection

17 May 2023

Papers citing "Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection"

20 / 20 papers shown

Title
Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends M. Tami Mohammed Elhenawy Huthaifa I. Ashqar 41 0 0 21 Apr 2025
BiasEdit: Debiasing Stereotyped Language Models via Model Editing Xin Xu Wei Xu N. Zhang Julian McAuley KELM 44 0 0 11 Mar 2025
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip Torr Francesco Pinto 49 0 0 30 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 39 16 0 21 Oct 2024
AGR: Age Group fairness Reward for Bias Mitigation in LLMs Shuirong Cao Ruoxi Cheng Zhiqiang Wang 34 4 0 06 Sep 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas Chengyuan Deng Yiqun Duan Xin Jin Heng Chang Yijun Tian ... Kuofeng Gao Sihong He Jun Zhuang Lu Cheng Haohan Wang AILaw 43 16 0 08 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education Jinsook Lee Yann Hicke Renzhe Yu Christopher A. Brooks René F. Kizilcec AI4Ed 42 1 0 03 Jun 2024
Spectral Editing of Activations for Large Language Model Alignment Yifu Qiu Zheng Zhao Yftah Ziser Anna Korhonen E. Ponti Shay B. Cohen KELM LLMSV 28 15 0 15 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 46 115 0 28 Mar 2024
Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information Shadi Iskander Kira Radinsky Yonatan Belinkov 47 4 0 14 Mar 2024
On the Scaling Laws of Geographical Representation in Language Models Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 49 6 0 29 Feb 2024
MultiContrievers: Analysis of Dense Retrieval Representations Seraphina Goldfarb-Tarrant Pedro Rodriguez Jane Dwivedi-Yu Patrick Lewis 28 1 0 24 Feb 2024
Tackling Bias in Pre-trained Language Models: Current Trends and Under-represented Societies Vithya Yogarajan Gillian Dobbie Te Taka Keegan R. Neuwirth ALM 43 11 0 03 Dec 2023
Bias and Fairness in Large Language Models: A Survey Isabel O. Gallegos Ryan A. Rossi Joe Barrow Md Mehrab Tanjim Sungchul Kim Franck Dernoncourt Tong Yu Ruiyi Zhang Nesreen Ahmed AILaw 26 490 0 02 Sep 2023
A Survey on Fairness in Large Language Models Yingji Li Mengnan Du Rui Song Xin Wang Ying Wang ALM 52 59 0 20 Aug 2023
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 81 57 0 28 Jan 2022
Debiasing Methods in Natural Language Understanding Make Bias More Accessible Michael J. Mendelson Yonatan Belinkov 42 23 0 09 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021
Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation Tianlu Wang Xi Lin Nazneen Rajani Bryan McCann Vicente Ordonez Caimng Xiong CVBM 157 54 0 03 May 2020