ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.15576
  4. Cited By
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

24 February 2025
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
    LLMSV
ArXivPDFHTML

Papers citing "Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders"

27 / 27 papers shown
Title
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Zirui He
Mingyu Jin
Bo Shen
Ali Payani
Yongfeng Zhang
Mengnan Du
LLMSV
63
0
0
22 May 2025
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Zihao Li
Xu Wang
Yuzhe Yang
Ziyu Yao
Haoyi Xiong
Jundong Li
LLMSV
LRM
95
1
0
21 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Dong Shu
Xuansheng Wu
Haiyan Zhao
Jundong Li
Ninghao Liu
LLMSV
81
0
0
12 May 2025
Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics
Yiran He
Yun Cao
Bowen Yang
Zeyu Zhang
71
1
0
16 Apr 2025
Towards Trustworthy GUI Agents: A Survey
Towards Trustworthy GUI Agents: A Survey
Yucheng Shi
Wenhao Yu
Wenlin Yao
Wenhu Chen
Ninghao Liu
74
5
0
30 Mar 2025
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual
  Knowledge in GPT-2 Small
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
56
16
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
72
118
0
09 Aug 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse
  Autoencoders
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
66
100
0
19 Jul 2024
HelpSteer2: Open-source dataset for training top-performing reward
  models
HelpSteer2: Open-source dataset for training top-performing reward models
Zhilin Wang
Yi Dong
Olivier Delalleau
Jiaqi Zeng
Gerald Shen
Daniel Egert
Jimmy J. Zhang
Makesh Narsimhan Sreedhar
Oleksii Kuchaiev
AI4TS
94
101
0
12 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
105
146
0
28 Mar 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
63
97
0
07 Feb 2024
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
103
247
0
05 Oct 2023
From Language Modeling to Instruction Following: Understanding the
  Behavior Shift in LLMs after Instruction Tuning
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Xuansheng Wu
Wenlin Yao
Jianshu Chen
Xiaoman Pan
Xiaoyang Wang
Ninghao Liu
Dong Yu
LRM
47
31
0
30 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
52
147
0
18 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
90
412
0
15 Sep 2023
Jailbroken: How Does LLM Safety Training Fail?
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
191
949
0
05 Jul 2023
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with
  Human Preferences
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences
Xiao Liu
Hanyu Lai
Hao Yu
Yifan Xu
Aohan Zeng
Zhengxiao Du
Peng Zhang
Yuxiao Dong
Jie Tang
38
100
0
13 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
312
4,253
0
09 Jun 2023
Neuron to Graph: Interpreting Language Model Neurons at Scale
Neuron to Graph: Interpreting Language Model Neurons at Scale
Alex Foote
Neel Nanda
Esben Kran
Ioannis Konstas
Shay B. Cohen
Fazl Barez
MILM
53
26
0
31 May 2023
Enhancing Chat Language Models by Scaling High-quality Instructional
  Conversations
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding
Yulin Chen
Bokai Xu
Yujia Qin
Zhi Zheng
Shengding Hu
Zhiyuan Liu
Maosong Sun
Bowen Zhou
ALM
118
532
0
23 May 2023
A Primer in BERTology: What we know about how BERT works
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
83
1,496
0
27 Feb 2020
Evaluating Layers of Representation in Neural Machine Translation on
  Part-of-Speech and Semantic Tagging Tasks
Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks
Yonatan Belinkov
Lluís Màrquez i Villodre
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
James R. Glass
53
164
0
23 Jan 2018
Mixed Precision Training
Mixed Precision Training
Paulius Micikevicius
Sharan Narang
Jonah Alben
G. Diamos
Erich Elsen
...
Boris Ginsburg
Michael Houston
Oleksii Kuchaiev
Ganesh Venkatesh
Hao Wu
149
1,792
0
10 Oct 2017
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Sanjeev Arora
Yuanzhi Li
Yingyu Liang
Tengyu Ma
Andrej Risteski
75
282
0
14 Jan 2016
Delving Deep into Rectifiers: Surpassing Human-Level Performance on
  ImageNet Classification
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
VLM
286
18,587
0
06 Feb 2015
Adam: A Method for Stochastic Optimization
Adam: A Method for Stochastic Optimization
Diederik P. Kingma
Jimmy Ba
ODL
1.5K
149,842
0
22 Dec 2014
k-Sparse Autoencoders
k-Sparse Autoencoders
Alireza Makhzani
Brendan J. Frey
89
451
0
19 Dec 2013
1