Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
Shauna Kravec
Zac Hatfield-Dodds
R. Lasenby
Dawn Drain
Carol Chen
Roger C. Grosse
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah

Abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
View on arXivComments on this paper