Interpreting Attention Layer Outputs with Sparse Autoencoders

Interpreting Attention Layer Outputs with Sparse Autoencoders

25 June 2024

Robert Krzyzanowski

Joseph Isaac Bloom

ArXiv (abs)PDF HTML

Papers citing "Interpreting Attention Layer Outputs with Sparse Autoencoders"

15 / 15 papers shown

Title
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 81 1 0 23 May 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Bartosz Cywiński Kamil Deja DiffM 122 9 0 29 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 103 3 0 09 Jan 2025
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 173 2 0 20 Nov 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 78 5 0 06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 176 33 0 02 Jul 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 80 48 0 23 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 140 159 0 28 Mar 2024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai Alan Cooney Neel Nanda HILM KELM 70 20 0 11 Feb 2024
Steering Llama 2 via Contrastive Activation Addition Nina Rimsky Nick Gabrieli Julian Schulz Meg Tong Evan Hubinger Alexander Matt Turner LLMSV 59 226 0 09 Dec 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 316 563 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 198 380 0 21 Sep 2022
Compositional Explanations of Neurons Jesse Mu Jacob Andreas FAtt CoGe MILM 87 179 0 24 Jun 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Colin Raffel Noam M. Shazeer Adam Roberts Katherine Lee Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu AIMat 506 20,376 0 23 Oct 2019
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned Elena Voita David Talbot F. Moiseev Rico Sennrich Ivan Titov 119 1,149 0 23 May 2019

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.