Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.01892
Cited By
Polysemanticity and Capacity in Neural Networks
4 October 2022
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Polysemanticity and Capacity in Neural Networks"
22 / 22 papers shown
Title
Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
Yuatyong Chaichana
Thanapat Trachu
Peerat Limkonchotiwat
Konpat Preechakul
Tirasan Khandhawit
Ekapol Chuangsuwanich
MoMe
52
0
0
29 May 2025
A Closer Look at Multimodal Representation Collapse
Abhra Chaudhuri
Anjan Dutta
Tu Bui
Serban Georgescu
18
0
0
28 May 2025
Towards Combinatorial Interpretability of Neural Computation
Micah Adler
Dan Alistarh
Nir Shavit
FAtt
321
2
0
10 Apr 2025
QPM: Discrete Optimization for Globally Interpretable Image Classification
Thomas Norrenbrock
Timo Kaiser
Sovan Biswas
R. Manuvinakurike
Bodo Rosenhahn
117
0
0
27 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
67
0
0
15 Feb 2025
Causal Abstraction in Model Interpretability: A Compact Survey
Yihao Zhang
64
0
0
26 Oct 2024
On the Complexity of Neural Computation in Superposition
Micah Adler
Nir Shavit
160
4
0
05 Sep 2024
Mathematical Models of Computation in Superposition
Kaarel Hänni
Jake Mendel
Dmitry Vaintrob
Lawrence Chan
SupR
61
10
0
10 Aug 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
83
5
0
19 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
140
31
0
02 Jul 2024
Fine-tuned network relies on generic representation to solve unseen cognitive task
Dongyan Lin
59
0
0
27 Jun 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
92
145
0
22 Apr 2024
Generating Interpretable Networks using Hypernetworks
Isaac Liao
Ziming Liu
Max Tegmark
57
2
0
05 Dec 2023
Measuring Feature Sparsity in Language Models
Mingyang Deng
Lucas Tao
Joe Benton
42
1
0
11 Oct 2023
SPADE: Sparsity-Guided Debugging for Deep Neural Networks
Arshia Soltani Moakhar
Eugenia Iofinova
Elias Frantar
Dan Alistarh
82
2
0
06 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
155
108
0
27 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
Neel Nanda
Andrew Lee
Martin Wattenberg
FAtt
MILM
78
177
0
02 Sep 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
185
211
0
02 May 2023
Disentangling Neuron Representations with Concept Vectors
Laura O'Mahony
Vincent Andrearczyk
Henning Muller
Mara Graziani
MILM
70
14
0
19 Apr 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
David Lindner
János Kramár
Sebastian Farquhar
Matthew Rahtz
Tom McGrath
Vladimir Mikulik
68
75
0
12 Jan 2023
Engineering Monosemanticity in Toy Models
Adam Jermyn
Nicholas Schiefer
Evan Hubinger
MILM
48
10
0
16 Nov 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
172
363
0
21 Sep 2022
1