Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.16188
Cited By
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
22 May 2025
Zirui He
Mingyu Jin
Bo Shen
Ali Payani
Yongfeng Zhang
Mengnan Du
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"
16 / 16 papers shown
Title
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Yongbin Li
Zhiting Fan
Ruizhe Chen
Xiaotang Gai
Luqi Gong
Yan Zhang
Zuozhu Liu
LLMSV
64
5
0
20 Apr 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
LLMSV
134
9
0
24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
107
13
0
23 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
Z. He
Haiyan Zhao
Yiran Qiao
Fan Yang
Ali Payani
Jing Ma
Jundong Li
LLMSV
97
9
0
17 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant
Shan Chen
Kuleen Sasse
Hugo J. W. L. Aerts
Thomas Hartvigsen
Danielle S. Bitterman
70
5
0
17 Feb 2025
A Unified Understanding and Evaluation of Steering Methods
Shawn Im
Yixuan Li
LLMSV
64
6
0
04 Feb 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
141
27
0
21 Nov 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
86
5
0
30 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
99
12
0
26 May 2024
Word Embeddings Are Steers for Language Models
Chi Han
Jialiang Xu
Manling Li
Yi R. Fung
Chenkai Sun
Nan Jiang
Tarek Abdelzaher
Heng Ji
LLMSV
64
40
0
22 May 2023
Efficient fair PCA for fair representation learning
Matthäus Kleindessner
Michele Donini
Chris Russell
Muhammad Bilal Zafar
FaML
46
15
0
26 Feb 2023
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li
Ari Holtzman
Daniel Fried
Percy Liang
Jason Eisner
Tatsunori Hashimoto
Luke Zettlemoyer
M. Lewis
89
358
0
27 Oct 2022
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
118
3,723
0
03 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
256
440
0
24 Feb 2021
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
Sumanth Dathathri
Andrea Madotto
Janice Lan
Jane Hung
Eric Frank
Piero Molino
J. Yosinski
Rosanne Liu
KELM
121
968
0
04 Dec 2019
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
Been Kim
Martin Wattenberg
Justin Gilmer
Carrie J. Cai
James Wexler
F. Viégas
Rory Sayres
FAtt
201
1,837
0
30 Nov 2017
1