ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.11296
  4. Cited By

Steering Language Model Refusal with Sparse Autoencoders

18 November 2024
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
    LLMSV
ArXivPDFHTML

Papers citing "Steering Language Model Refusal with Sparse Autoencoders"

3 / 3 papers shown
Title
Patterns and Mechanisms of Contrastive Activation Engineering
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
62
0
0
06 May 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
65
0
0
08 Mar 2025
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context
  Learning Tasks
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Madeline Brumley
Joe Kwon
David M. Krueger
Dmitrii Krasheninnikov
Usman Anwar
LLMSV
39
6
0
11 Nov 2024
1