Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.03637
Cited By
Discovering Variable Binding Circuitry with Desiderata
7 July 2023
Xander Davies
Max Nadeau
Nikhil Prakash
Tamar Rott Shaham
David Bau
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Discovering Variable Binding Circuitry with Desiderata"
13 / 13 papers shown
Title
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
43
1
0
17 Apr 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
37
2
0
27 Feb 2025
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás
Christopher Potts
Christopher D. Manning
Atticus Geiger
GAN
28
16
0
20 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
52
18
0
02 Aug 2024
Philosophy of Cognitive Science in the Age of Deep Learning
Raphaël Millière
AI4CE
NAI
40
3
0
07 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
114
0
22 Apr 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
59
27
0
27 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Nikhil Prakash
Tamar Rott Shaham
Tal Haklay
Yonatan Belinkov
David Bau
49
52
0
22 Feb 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
76
10
0
13 Dec 2023
Grokking Group Multiplication with Cosets
Dashiell Stander
Qinan Yu
Honglu Fan
Stella Biderman
38
9
0
11 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
17
14
0
06 Dec 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
75
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
1