Investigating Bias Representations in Llama 2 Chat via Activation Steering

1 February 2024

Papers citing "Investigating Bias Representations in Llama 2 Chat via Activation Steering"

5 / 5 papers shown

Title
Improving Multilingual Language Models by Aligning Representations through Steering Omar Mahmoud B. L. Semage Thommen George Karimpanal Santu Rana LLMSV 58 0 0 19 May 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che Stephen Casper Robert Kirk Anirudh Satheesh Stewart Slocum ... Zikui Cai Bilal Chughtai Y. Gal Furong Huang Dylan Hadfield-Menell MU AAML ELM 117 6 0 03 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 104 26 0 15 Oct 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi Karthikeyan N. Ramamurthy Erik Miehling Pierre Dognin Manish Nagireddy Amit Dhurandhar LLMSV 144 23 0 06 Sep 2024
The Woman Worked as a Babysitter: On Biases in Language Generation Emily Sheng Kai-Wei Chang Premkumar Natarajan Nanyun Peng 276 642 0 03 Sep 2019