Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.00593
Cited By
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
1 November 2022
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"
28 / 128 papers shown
Title
The mechanistic basis of data dependence and abrupt learning in an in-context classification task
Gautam Reddy
29
53
0
03 Dec 2023
Labeling Neural Representations with Inverse Recognition
Kirill Bykov
Laura Kopf
Shinichi Nakajima
Marius Kloft
Marina M.-C. Höhne
BDL
46
15
0
22 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
39
8
0
21 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
78
7
0
07 Nov 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
40
27
0
26 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
30
16
0
19 Oct 2023
Interpretable Diffusion via Information Decomposition
Xianghao Kong
Ollie Liu
Han Li
Dani Yogatama
Greg Ver Steeg
29
21
0
12 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
Anna Langedijk
Hosein Mohebbi
Gabriele Sarti
Willem H. Zuidema
Jaap Jumelet
37
10
0
05 Oct 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
54
142
0
03 Oct 2023
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Albert Garde
Esben Kran
Fazl Barez
26
2
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
41
101
0
27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
35
347
0
15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
34
20
0
27 Aug 2023
PMET: Precise Model Editing in a Transformer
Xiaopeng Li
Shasha Li
Shezheng Song
Jing Yang
Jun Ma
Jie Yu
KELM
39
119
0
17 Aug 2023
Causal interventions expose implicit situation models for commonsense language understanding
Takateru Yamakoshi
James L. McClelland
A. Goldberg
Robert D. Hawkins
37
6
0
06 Jun 2023
Physics of Language Models: Part 1, Learning Hierarchical Language Structures
Zeyuan Allen-Zhu
Yuanzhi Li
40
17
0
23 May 2023
Explaining How Transformers Use Context to Build Predictions
Javier Ferrando
Gerard I. Gállego
Ioannis Tsiamas
Marta R. Costa-jussá
37
32
0
21 May 2023
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu
Eric Gan
Max Tegmark
26
36
0
04 May 2023
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
43
6
0
13 Apr 2023
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
42
85
0
12 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
27
196
0
14 Mar 2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Bilal Chughtai
Lawrence Chan
Neel Nanda
21
96
0
06 Feb 2023
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
49
386
0
12 Jan 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
David Lindner
János Kramár
Sebastian Farquhar
Matthew Rahtz
Tom McGrath
Vladimir Mikulik
39
72
0
12 Jan 2023
Transformers Learn Shortcuts to Automata
Bingbin Liu
Jordan T. Ash
Surbhi Goel
A. Krishnamurthy
Cyril Zhang
OffRL
LRM
53
158
0
19 Oct 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
252
476
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
68
183
0
30 Aug 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
206
117
0
26 Jan 2022
Previous
1
2
3