Discovering Variable Binding Circuitry with Desiderata

Discovering Variable Binding Circuitry with Desiderata

7 July 2023

Tamar Rott Shaham

Papers citing "Discovering Variable Binding Circuitry with Desiderata"

13 / 13 papers shown

Title
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 43 1 0 17 Apr 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Vishnu Kabir Chhabra Ding Zhu Mohammad Mahdi Khalili 37 2 0 27 Feb 2025
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations Róbert Csordás Christopher Potts Christopher D. Manning Atticus Geiger GAN 28 16 0 20 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Philosophy of Cognitive Science in the Age of Deep Learning Raphaël Millière AI4CE NAI 40 3 0 07 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 114 0 22 Apr 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 59 27 0 27 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash Tamar Rott Shaham Tal Haklay Yonatan Belinkov David Bau 49 52 0 22 Feb 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 76 10 0 13 Dec 2023
Grokking Group Multiplication with Cosets Dashiell Stander Qinan Yu Honglu Fan Stella Biderman 38 9 0 11 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models Dan Friedman Andrew Kyle Lampinen Lucas Dixon Danqi Chen Asma Ghandeharioun 17 14 0 06 Dec 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 75 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022