Engineering Monosemanticity in Toy Models

16 November 2022

Papers citing "Engineering Monosemanticity in Toy Models"

8 / 8 papers shown

Title
Mixture of Experts Made Intrinsically Interpretable Xingyi Yang Constantin Venhoff Ashkan Khakzar Christian Schroeder de Witt P. Dokania Adel Bibi Philip Torr MoE 57 0 0 05 Mar 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 51 0 0 15 Feb 2025
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Luke Marks Alasdair Paren David M. Krueger Fazl Barez AAML 27 4 0 02 Nov 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 45 118 0 22 Apr 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 165 190 0 02 May 2023
Disentangling Neuron Representations with Concept Vectors Laura O'Mahony Vincent Andrearczyk Henning Muller Mara Graziani MILM 34 14 0 19 Apr 2023
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 135 25 0 04 Oct 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 133 326 0 21 Sep 2022