SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

22 May 2025

Papers citing "SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"

16 / 16 papers shown

Title
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering Yongbin Li Zhiting Fan Ruizhe Chen Xiaotang Gai Luqi Gong Yan Zhang Zuozhu Liu LLMSV 64 5 0 20 Apr 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Xuansheng Wu Jiayi Yuan Wenlin Yao Xiaoming Zhai Ninghao Liu LLMSV 134 9 0 24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 107 13 0 23 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Z. He Haiyan Zhao Yiran Qiao Fan Yang Ali Payani Jing Ma Jundong Li LLMSV 97 9 0 17 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant Shan Chen Kuleen Sasse Hugo J. W. L. Aerts Thomas Hartvigsen Danielle S. Bitterman 70 5 0 17 Feb 2025
A Unified Understanding and Evaluation of Steering Methods Shawn Im Yixuan Li LLMSV 64 6 0 04 Feb 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 141 27 0 21 Nov 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 86 5 0 30 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 99 12 0 26 May 2024
Word Embeddings Are Steers for Language Models Chi Han Jialiang Xu Manling Li Yi R. Fung Chenkai Sun Nan Jiang Tarek Abdelzaher Heng Ji LLMSV 64 40 0 22 May 2023
Efficient fair PCA for fair representation learning Matthäus Kleindessner Michele Donini Chris Russell Muhammad Bilal Zafar FaML 46 15 0 26 Feb 2023
Contrastive Decoding: Open-ended Text Generation as Optimization Xiang Lisa Li Ari Holtzman Daniel Fried Percy Liang Jason Eisner Tatsunori Hashimoto Luke Zettlemoyer M. Lewis 89 358 0 27 Oct 2022
Finetuned Language Models Are Zero-Shot Learners Jason W. Wei Maarten Bosma Vincent Zhao Kelvin Guu Adams Wei Yu Brian Lester Nan Du Andrew M. Dai Quoc V. Le ALM UQCV 118 3,723 0 03 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 256 440 0 24 Feb 2021
Plug and Play Language Models: A Simple Approach to Controlled Text Generation Sumanth Dathathri Andrea Madotto Janice Lan Jane Hung Eric Frank Piero Molino J. Yosinski Rosanne Liu KELM 121 968 0 04 Dec 2019
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim Martin Wattenberg Justin Gilmer Carrie J. Cai James Wexler F. Viégas Rory Sayres FAtt 201 1,837 0 30 Nov 2017