Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.17727
Cited By
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
29 January 2025
Thomas Heap
Tim Lawson
Lucy Farnik
Laurence Aitchison
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Sparse Autoencoders Can Interpret Randomly Initialized Transformers"
12 / 12 papers shown
Title
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
Vadim Kurochkin
Yaroslav Aksenov
Daniil Laptev
Daniil Gavrilov
Nikita Balagansky
36
0
0
28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
66
1
0
23 May 2025
Explaining Neural Networks with Reasons
Levin Hornischer
Hannes Leitgeb
FAtt
AAML
MILM
88
0
0
20 May 2025
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
Woody Haosheng Gan
Deqing Fu
Julian Asilis
Ollie Liu
Dani Yogatama
Vatsal Sharan
Robin Jia
Willie Neiswanger
LLMSV
69
0
0
20 May 2025
SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Jeremy Budd
Javier Ideami
Benjamin Macdowall Rynne
Keith Duggar
Randall Balestriero
68
0
0
17 May 2025
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
Bofan Gong
Shiyang Lai
Dawn Song
AAML
MILM
49
1
0
16 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
81
0
0
15 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks
Robin Hesse
Jonas Fischer
Simone Schaub-Meyer
Stefan Roth
FAtt
MILM
94
0
0
17 Apr 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
137
21
0
12 Mar 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik
Tim Lawson
Conor Houghton
Laurence Aitchison
82
1
0
25 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
199
0
0
24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
107
13
0
23 Feb 2025
1