The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

30 June 2023

Papers citing "The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks"

26 / 26 papers shown

Title
Understanding In-context Learning of Addition via Activation Subspaces Xinyan Hu Kayo Yin Michael I. Jordan Jacob Steinhardt Lijie Chen 53 0 0 08 May 2025
Representation Learning on a Random Lattice Aryeh Brill OOD FAtt AI4CE 73 0 0 28 Apr 2025
Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition Akshay Rangamani 42 0 0 28 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach David D. Baek Max Tegmark LRM 80 3 0 05 Mar 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 46 0 0 04 Mar 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 49 0 0 15 Feb 2025
Feature Importance Depends on Properties of the Data: Towards Choosing the Correct Explanations for Your Data and Decision Trees based Models Célia Wafa Ayad Thomas Bonnier Benjamin Bosch Sonali Parbhoo Jesse Read FAtt XAI 103 0 0 11 Feb 2025
Harmonic Loss Trains Interpretable AI Models David D. Baek Ziming Liu Riya Tyagi Max Tegmark 97 2 0 03 Feb 2025
Physics of Skill Learning Ziming Liu Yizhou Liu Eric J. Michaud Jeff Gore Max Tegmark 46 1 0 21 Jan 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers Jiajun Song Zhuoyan Xu Yiqiao Zhong 88 4 0 31 Dec 2024
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis Zeping Yu Sophia Ananiadou LRM MILM 27 6 0 21 Sep 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 66 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 39 5 0 27 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 42 7 0 31 May 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 48 3 0 27 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 98 475 0 30 Apr 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 23 12 0 07 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 36 97 0 27 Sep 2023
It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models Xingcheng Xu Zihao Pan Haipeng Zhang Yanqing Yang LRM 18 2 0 16 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 162 188 0 02 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 497 0 01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 56 77 0 03 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 463 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 131 322 0 21 Sep 2022