Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.15576
Cited By
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
24 February 2025
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders"
27 / 27 papers shown
Title
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Zirui He
Mingyu Jin
Bo Shen
Ali Payani
Yongfeng Zhang
Mengnan Du
LLMSV
63
0
0
22 May 2025
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Zihao Li
Xu Wang
Yuzhe Yang
Ziyu Yao
Haoyi Xiong
Jundong Li
LLMSV
LRM
95
1
0
21 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Dong Shu
Xuansheng Wu
Haiyan Zhao
Jundong Li
Ninghao Liu
LLMSV
81
0
0
12 May 2025
Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
Yiran He
Yun Cao
Bowen Yang
Zeyu Zhang
71
1
0
16 Apr 2025
Towards Trustworthy GUI Agents: A Survey
Yucheng Shi
Wenhao Yu
Wenlin Yao
Wenhu Chen
Ninghao Liu
74
5
0
30 Mar 2025
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
56
16
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
72
118
0
09 Aug 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
66
100
0
19 Jul 2024
HelpSteer2: Open-source dataset for training top-performing reward models
Zhilin Wang
Yi Dong
Olivier Delalleau
Jiaqi Zeng
Gerald Shen
Daniel Egert
Jimmy J. Zhang
Makesh Narsimhan Sreedhar
Oleksii Kuchaiev
AI4TS
94
101
0
12 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
105
146
0
28 Mar 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
63
97
0
07 Feb 2024
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
103
247
0
05 Oct 2023
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Xuansheng Wu
Wenlin Yao
Jianshu Chen
Xiaoman Pan
Xiaoyang Wang
Ninghao Liu
Dong Yu
LRM
47
31
0
30 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
52
147
0
18 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
90
412
0
15 Sep 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
191
949
0
05 Jul 2023
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences
Xiao Liu
Hanyu Lai
Hao Yu
Yifan Xu
Aohan Zeng
Zhengxiao Du
Peng Zhang
Yuxiao Dong
Jie Tang
38
100
0
13 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
312
4,253
0
09 Jun 2023
Neuron to Graph: Interpreting Language Model Neurons at Scale
Alex Foote
Neel Nanda
Esben Kran
Ioannis Konstas
Shay B. Cohen
Fazl Barez
MILM
53
26
0
31 May 2023
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding
Yulin Chen
Bokai Xu
Yujia Qin
Zhi Zheng
Shengding Hu
Zhiyuan Liu
Maosong Sun
Bowen Zhou
ALM
118
532
0
23 May 2023
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
83
1,496
0
27 Feb 2020
Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks
Yonatan Belinkov
Lluís Màrquez i Villodre
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
James R. Glass
53
164
0
23 Jan 2018
Mixed Precision Training
Paulius Micikevicius
Sharan Narang
Jonah Alben
G. Diamos
Erich Elsen
...
Boris Ginsburg
Michael Houston
Oleksii Kuchaiev
Ganesh Venkatesh
Hao Wu
149
1,792
0
10 Oct 2017
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Sanjeev Arora
Yuanzhi Li
Yingyu Liang
Tengyu Ma
Andrej Risteski
75
282
0
14 Jan 2016
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
VLM
286
18,587
0
06 Feb 2015
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.5K
149,842
0
22 Dec 2014
k-Sparse Autoencoders
Alireza Makhzani
Brendan J. Frey
89
451
0
19 Dec 2013
1