Sparse Autoencoders Can Interpret Randomly Initialized Transformers

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

29 January 2025

Laurence Aitchison

Papers citing "Sparse Autoencoders Can Interpret Randomly Initialized Transformers"

12 / 12 papers shown

Title
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation Vadim Kurochkin Yaroslav Aksenov Daniil Laptev Daniil Gavrilov Nikita Balagansky 36 0 0 28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 66 1 0 23 May 2025
Explaining Neural Networks with Reasons Levin Hornischer Hannes Leitgeb FAtt AAML MILM 88 0 0 20 May 2025
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models Woody Haosheng Gan Deqing Fu Julian Asilis Ollie Liu Dani Yogatama Vatsal Sharan Robin Jia Willie Neiswanger LLMSV 69 0 0 20 May 2025
SplInterp: Improving our Understanding and Training of Sparse Autoencoders Jeremy Budd Javier Ideami Benjamin Macdowall Rynne Keith Duggar Randall Balestriero 68 0 0 17 May 2025
Probing the Vulnerability of Large Language Models to Polysemantic Interventions Bofan Gong Shiyang Lai Dawn Song AAML MILM 49 1 0 16 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 81 0 0 15 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks Robin Hesse Jonas Fischer Simone Schaub-Meyer Stefan Roth FAtt MILM 94 0 0 17 Apr 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Matthew Wearden Arthur Conmy Arthur Conmy Samuel Marks Neel Nanda MU 137 21 0 12 Mar 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 82 1 0 25 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 199 0 0 24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 107 13 0 23 Feb 2025