Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

21 May 2025

Papers citing "Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering"

8 / 8 papers shown

Title
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Z. He Haiyan Zhao Yiran Qiao Fan Yang Ali Payani Jing Ma Jundong Li LLMSV 100 9 0 17 Feb 2025
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 107 26 0 15 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 89 5 0 30 Sep 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin James Wilken-Smith Tomáš Dulka Hardik Bhatnagar Joseph Bloom Joseph Isaac Bloom 86 35 0 22 Sep 2024
Uncovering Latent Chain of Thought Vectors in Language Models Jason Zhang Scott Viteri LLMSV LRM 95 3 0 21 Sep 2024
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering Sheng Liu Haotian Ye Lei Xing James Y. Zou 79 110 0 11 Nov 2023
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim Martin Wattenberg Justin Gilmer Carrie J. Cai James Wexler F. Viégas Rory Sayres FAtt 211 1,842 0 30 Nov 2017
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 687 131,526 0 12 Jun 2017