SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

22 May 2025

Papers citing "SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"

17 / 17 papers shown

Title
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering Yongbin Li Zhiting Fan Ruizhe Chen Xiaotang Gai Luqi Gong Yan Zhang Zuozhu Liu LLMSV 64 5 0 20 Apr 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Xuansheng Wu Jiayi Yuan Wenlin Yao Xiaoming Zhai Ninghao Liu LLMSV 134 9 0 24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 107 13 0 23 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Z. He Haiyan Zhao Yiran Qiao Fan Yang Ali Payani Jing Ma Jundong Li LLMSV 97 9 0 17 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant Shan Chen Kuleen Sasse Hugo J. W. L. Aerts Thomas Hartvigsen Danielle S. Bitterman 70 5 0 17 Feb 2025
A Unified Understanding and Evaluation of Steering Methods Shawn Im Yixuan Li LLMSV 66 6 0 04 Feb 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 141 27 0 21 Nov 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 86 5 0 30 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 99 12 0 26 May 2024
Word Embeddings Are Steers for Language Models Chi Han Jialiang Xu Manling Li Yi R. Fung Chenkai Sun Nan Jiang Tarek Abdelzaher Heng Ji LLMSV 64 40 0 22 May 2023
Efficient fair PCA for fair representation learning Matthäus Kleindessner Michele Donini Chris Russell Muhammad Bilal Zafar FaML 46 16 0 26 Feb 2023
Contrastive Decoding: Open-ended Text Generation as Optimization Xiang Lisa Li Ari Holtzman Daniel Fried Percy Liang Jason Eisner Tatsunori Hashimoto Luke Zettlemoyer M. Lewis 89 358 0 27 Oct 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 773 12,893 0 04 Mar 2022
Finetuned Language Models Are Zero-Shot Learners Jason W. Wei Maarten Bosma Vincent Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le ALM UQCV 120 3,742 0 03 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 256 443 0 24 Feb 2021
Plug and Play Language Models: A Simple Approach to Controlled Text Generation Sumanth Dathathri Andrea Madotto Janice Lan Jane Hung Eric Frank Piero Molino J. Yosinski Rosanne Liu KELM 123 969 0 04 Dec 2019
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim Martin Wattenberg Justin Gilmer Carrie J. Cai James Wexler F. Viégas Rory Sayres FAtt 207 1,837 0 30 Nov 2017