Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.11296
Cited By
Steering Language Model Refusal with Sparse Autoencoders
18 November 2024
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Steering Language Model Refusal with Sparse Autoencoders"
3 / 3 papers shown
Title
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
62
0
0
06 May 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
65
0
0
08 Mar 2025
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Madeline Brumley
Joe Kwon
David M. Krueger
Dmitrii Krasheninnikov
Usman Anwar
LLMSV
39
6
0
11 Nov 2024
1