Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.05147
Cited By
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
9 August 2024
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"
50 / 67 papers shown
Title
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
31
0
0
15 May 2025
An Introduction to Discrete Variational Autoencoders
Alan Jeffares
Liyuan Liu
DRL
BDL
CML
41
0
0
15 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
Dong Shu
Xuansheng Wu
Haiyan Zhao
Jundong Li
Ninghao Liu
LLMSV
42
0
0
12 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
67
0
0
06 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Hans Peter
Anders Søgaard
38
0
0
30 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
Jiadong Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J.N. Zhang
Xipeng Qiu
70
0
0
29 Apr 2025
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
Wenxiao Wang
Parsa Hosseini
S. Feizi
LRM
AI4CE
64
0
0
29 Apr 2025
Representation Learning on a Random Lattice
Aryeh Brill
OOD
FAtt
AI4CE
73
0
0
28 Apr 2025
Scaling sparse feature circuit finding for in-context learning
Dmitrii Kharlapenko
Shivalika Singh
Fazl Barez
Arthur Conmy
Neel Nanda
26
0
0
18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
51
1
0
17 Apr 2025
SAEs
Can
\textit{Can}
Can
Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
66
0
0
11 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Yixin Cao
Jiahao Ying
Yansen Wang
Xipeng Qiu
Xuanjing Huang
Yugang Jiang
ELM
44
2
0
10 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
38
0
0
10 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Sewoong Lee
Adam Davies
Marc E. Canby
J. Hockenmaier
LLMSV
70
0
0
31 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
34
0
0
21 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
59
1
0
21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Zhibo Wang
Huiyu Xu
Kui Ren
LLMSV
52
1
0
21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Kola Ayonrinde
Matthew Wearden
Arthur Conmy
Samuel Marks
Neel Nanda
MU
64
8
0
12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip Torr
MoE
54
0
0
05 Mar 2025
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov
Laida Kushnareva
Polina Druzhinina
Anton Razzhigaev
Anastasia Voznyuk
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
42
0
0
05 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
80
3
0
05 Mar 2025
SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs
Samir Abdaljalil
Filippo Pallucchini
Andrea Seveso
Hasan Kurban
Fabio Mercorio
Erchin Serpedin
HILM
77
0
0
04 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Sai Sumedh R. Hindupur
Ekdeep Singh Lubana
Thomas Fel
Demba Ba
47
5
0
03 Mar 2025
Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew
Hubert Baniecki
P. Biecek
56
0
0
27 Feb 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability
Lovis Heindrich
Philip Torr
Fazl Barez
Veronika Thost
82
1
0
27 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
135
0
0
24 Feb 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Xuansheng Wu
Jiayi Yuan
Wenlin Yao
Xiaoming Zhai
Ninghao Liu
LLMSV
80
4
0
24 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Yuheng Chen
Pengfei Cao
Kang Liu
Jun Zhao
50
0
0
18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Xinyi Yang
Liang Zeng
Heng Dong
Chao Yu
X. Wu
H. Yang
Yu Wang
Milind Tambe
Tonghan Wang
76
2
0
18 Feb 2025
Multi-Faceted Multimodal Monosemanticity
Hanqi Yan
Xiangxiang Cui
Lu Yin
Paul Pu Liang
Yulan He
Yifei Wang
44
0
0
16 Feb 2025
SEER: Self-Explainability Enhancement of Large Language Models' Representations
Guanxu Chen
Dongrui Liu
Tao Luo
Jing Shao
LRM
MILM
67
1
0
07 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
49
1
0
09 Jan 2025
Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning?
Mengyu Ye
Tatsuki Kuribayashi
Goro Kobayashi
Jun Suzuki
LRM
97
0
0
20 Dec 2024
VISTA: A Panoramic View of Neural Representations
Tom White
72
0
0
03 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Shusaku Sone
Masaya Taniguchi
Ana Brassard
Keisuke Sakaguchi
Kentaro Inui
ReLM
LRM
77
0
0
02 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
85
12
0
21 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
98
1
0
20 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle
Felix Friedrich
Manuel Brack
Bjorn Deiseroth
P. Schramowski
Kristian Kersting
45
0
0
11 Nov 2024
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction
Yushi Yang
Filip Sondej
Harry Mayne
Adam Mahdi
29
1
0
10 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
42
5
0
07 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Sviatoslav Chalnev
Matthew Siu
Arthur Conmy
LLMSV
55
16
0
04 Nov 2024
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo
Chenyang Song
Xu Han
Yuxiao Chen
Chaojun Xiao
Zhiyuan Liu
Maosong Sun
54
3
0
04 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
Aashiq Muhamed
Mona Diab
Virginia Smith
45
2
0
01 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
40
2
0
28 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
40
13
0
28 Oct 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Qi Zhang
Yifei Wang
Jingyi Cui
Xiang Pan
Qi Lei
Stefanie Jegelka
Yisen Wang
AAML
34
1
0
27 Oct 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He
Wentao Shu
Xuyang Ge
Lingjie Chen
Junxuan Wang
...
Qipeng Guo
Xuanjing Huang
Zuxuan Wu
Yu-Gang Jiang
Xipeng Qiu
40
14
0
27 Oct 2024
Applying sparse autoencoders to unlearn knowledge in language models
Eoin Farrell
Yeu-Tong Lau
Arthur Conmy
MU
35
14
0
25 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
65
10
0
18 Oct 2024
The Persian Rug: solving toy models of superposition using large-scale symmetries
Aditya Cowsik
Kfir Dolev
Alex Infanger
26
0
0
15 Oct 2024
1
2
Next