Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.09169
Cited By
Engineering Monosemanticity in Toy Models
16 November 2022
Adam Jermyn
Nicholas Schiefer
Evan Hubinger
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Engineering Monosemanticity in Toy Models"
8 / 8 papers shown
Title
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip Torr
MoE
57
0
0
05 Mar 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
51
0
0
15 Feb 2025
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Luke Marks
Alasdair Paren
David M. Krueger
Fazl Barez
AAML
29
4
0
02 Nov 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
45
118
0
22 Apr 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
165
192
0
02 May 2023
Disentangling Neuron Representations with Concept Vectors
Laura O'Mahony
Vincent Andrearczyk
Henning Muller
Mara Graziani
MILM
40
14
0
19 Apr 2023
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
135
25
0
04 Oct 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
133
326
0
21 Sep 2022
1