Enhancing Neural Network Interpretability with Feature-Aligned Sparse
Autoencoders

v1v2 (latest)

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

2 November 2024

David M. Krueger

ArXiv (abs)PDF HTML

Papers citing "Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders"

11 / 11 papers shown

Title
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 174 33 0 02 Jul 2024
Rigorously Assessing Natural Language Explanations of Neurons Jing-ling Huang Atticus Geiger Karel DÓosterlinck Zhengxuan Wu Christopher Potts MILM 75 29 0 19 Sep 2023
Interpreting Neural Networks through the Polytope Lens Sid Black Lee D. Sharkey Léo Grinsztajn Eric Winsor Daniel A. Braun ... Kip Parker Carlos Ramón Guevara Beren Millidge Gabriel Alfour Connor Leahy FAtt MILM 72 26 0 22 Nov 2022
Engineering Monosemanticity in Toy Models Adam Jermyn Nicholas Schiefer Evan Hubinger MILM 52 10 0 16 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 198 380 0 21 Sep 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 476 2,123 0 31 Dec 2020
Deep Co-Training for Semi-Supervised Image Recognition Siyuan Qiao Wei Shen Zhishuai Zhang Bo Wang Alan Yuille 64 451 0 15 Mar 2018
Fraternal Dropout Konrad Zolna Devansh Arpit Dendi Suhubdy Yoshua Bengio 52 53 0 31 Oct 2017
Deep Mutual Learning Ying Zhang Tao Xiang Timothy M. Hospedales Huchuan Lu FedML 155 1,656 0 01 Jun 2017
Network Dissection: Quantifying Interpretability of Deep Visual Representations David Bau Bolei Zhou A. Khosla A. Oliva Antonio Torralba MILM FAtt 158 1,526 1 19 Apr 2017
Temporal Ensembling for Semi-Supervised Learning S. Laine Timo Aila UQCV 192 2,570 0 07 Oct 2016