Polysemanticity and Capacity in Neural Networks

4 October 2022

Papers citing "Polysemanticity and Capacity in Neural Networks"

22 / 22 papers shown

Title
Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking Yuatyong Chaichana Thanapat Trachu Peerat Limkonchotiwat Konpat Preechakul Tirasan Khandhawit Ekapol Chuangsuwanich MoMe 52 0 0 29 May 2025
A Closer Look at Multimodal Representation Collapse Abhra Chaudhuri Anjan Dutta Tu Bui Serban Georgescu 18 0 0 28 May 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 321 2 0 10 Apr 2025
QPM: Discrete Optimization for Globally Interpretable Image Classification Thomas Norrenbrock Timo Kaiser Sovan Biswas R. Manuvinakurike Bodo Rosenhahn 117 0 0 27 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 67 0 0 15 Feb 2025
Causal Abstraction in Model Interpretability: A Compact Survey Yihao Zhang 64 0 0 26 Oct 2024
On the Complexity of Neural Computation in Superposition Micah Adler Nir Shavit 160 4 0 05 Sep 2024
Mathematical Models of Computation in Superposition Kaarel Hänni Jake Mendel Dmitry Vaintrob Lawrence Chan SupR 61 10 0 10 Aug 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 83 5 0 19 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 140 31 0 02 Jul 2024
Fine-tuned network relies on generic representation to solve unseen cognitive task Dongyan Lin 59 0 0 27 Jun 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 92 145 0 22 Apr 2024
Generating Interpretable Networks using Hypernetworks Isaac Liao Ziming Liu Max Tegmark 57 2 0 05 Dec 2023
Measuring Feature Sparsity in Language Models Mingyang Deng Lucas Tao Joe Benton 42 1 0 11 Oct 2023
SPADE: Sparsity-Guided Debugging for Deep Neural Networks Arshia Soltani Moakhar Eugenia Iofinova Elias Frantar Dan Alistarh 82 2 0 06 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 155 108 0 27 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models Neel Nanda Andrew Lee Martin Wattenberg FAtt MILM 78 177 0 02 Sep 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 185 211 0 02 May 2023
Disentangling Neuron Representations with Concept Vectors Laura O'Mahony Vincent Andrearczyk Henning Muller Mara Graziani MILM 70 14 0 19 Apr 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability David Lindner János Kramár Sebastian Farquhar Matthew Rahtz Tom McGrath Vladimir Mikulik 68 75 0 12 Jan 2023
Engineering Monosemanticity in Toy Models Adam Jermyn Nicholas Schiefer Evan Hubinger MILM 48 10 0 16 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 172 363 0 21 Sep 2022