Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.
View on arXiv@article{zhao2025_2505.15038, title={ Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering }, author={ Haiyan Zhao and Xuansheng Wu and Fan Yang and Bo Shen and Ninghao Liu and Mengnan Du }, journal={arXiv preprint arXiv:2505.15038}, year={ 2025 } }