Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.03656
Cited By
Interpretability Illusions in the Generalization of Simplified Models
6 December 2023
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpretability Illusions in the Generalization of Simplified Models"
6 / 6 papers shown
Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
85
1
0
02 May 2025
LangVAE and LangSpace: Building and Probing for Language Model VAEs
Danilo S. Carvalho
Yingji Zhang
Harriet Unsworth
André Freitas
36
0
0
29 Mar 2025
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
193
121
0
30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
191
266
0
28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
497
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
463
0
24 Sep 2022
1