Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.16188
Cited By
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
22 May 2025
Zirui He
Mingyu Jin
Bo Shen
Ali Payani
Yongfeng Zhang
Mengnan Du
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models"
17 / 17 papers shown
Title
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Yongbin Li
Zhiting Fan
Ruizhe Chen
Xiaotang Gai
Luqi Gong
Yan Zhang
Zuozhu Liu
LLMSV
64
5
0
20 Apr 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
LLMSV
134
9
0
24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
107
13
0
23 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
Z. He
Haiyan Zhao
Yiran Qiao
Fan Yang
Ali Payani
Jing Ma
Jundong Li
LLMSV
97
9
0
17 Feb 2025
Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant
Shan Chen
Kuleen Sasse
Hugo J. W. L. Aerts
Thomas Hartvigsen
Danielle S. Bitterman
70
5
0
17 Feb 2025
A Unified Understanding and Evaluation of Steering Methods
Shawn Im
Yixuan Li
LLMSV
66
6
0
04 Feb 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
141
27
0
21 Nov 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
86
5
0
30 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
99
12
0
26 May 2024
Word Embeddings Are Steers for Language Models
Chi Han
Jialiang Xu
Manling Li
Yi R. Fung
Chenkai Sun
Nan Jiang
Tarek Abdelzaher
Heng Ji
LLMSV
64
40
0
22 May 2023
Efficient fair PCA for fair representation learning
Matthäus Kleindessner
Michele Donini
Chris Russell
Muhammad Bilal Zafar
FaML
46
16
0
26 Feb 2023
Contrastive Decoding: Open-ended Text Generation as Optimization
Xiang Lisa Li
Ari Holtzman
Daniel Fried
Percy Liang
Jason Eisner
Tatsunori Hashimoto
Luke Zettlemoyer
M. Lewis
89
358
0
27 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
773
12,893
0
04 Mar 2022
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
120
3,742
0
03 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
256
443
0
24 Feb 2021
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
Sumanth Dathathri
Andrea Madotto
Janice Lan
Jane Hung
Eric Frank
Piero Molino
J. Yosinski
Rosanne Liu
KELM
123
969
0
04 Dec 2019
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
Been Kim
Martin Wattenberg
Justin Gilmer
Carrie J. Cai
James Wexler
F. Viégas
Rory Sayres
FAtt
207
1,837
0
30 Nov 2017
1