Residual Stream Analysis with Multi-Layer SAEs

6 September 2024

Papers citing "Residual Stream Analysis with Multi-Layer SAEs"

6 / 6 papers shown

Title
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 61 0 0 25 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models Jesseba Fernando Grigori Guitchounts AI4CE 41 0 0 17 Feb 2025
Steering Language Model Refusal with Sparse Autoencoders Kyle O'Brien David Majercak Xavier Fernandes Richard Edgar Jingya Chen Harsha Nori Dean Carignan Eric Horvitz Forough Poursabzi-Sangde LLMSV 67 10 0 18 Nov 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 497 0 01 Nov 2022
Disentanglement with Biological Constraints: A Theory of Functional Cell Types James C. R. Whittington W. Dorrell Surya Ganguli Timothy Edward John Behrens 47 48 0 30 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 131 322 0 21 Sep 2022