ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.00593
  4. Cited By
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

1 November 2022
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"

28 / 128 papers shown
Title
The mechanistic basis of data dependence and abrupt learning in an
  in-context classification task
The mechanistic basis of data dependence and abrupt learning in an in-context classification task
Gautam Reddy
29
53
0
03 Dec 2023
Labeling Neural Representations with Inverse Recognition
Labeling Neural Representations with Inverse Recognition
Kirill Bykov
Laura Kopf
Shinichi Nakajima
Marius Kloft
Marina M.-C. Höhne
BDL
46
15
0
22 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on
  Synthetic, Interpretable Tasks
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
39
8
0
21 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
78
7
0
07 Nov 2023
Codebook Features: Sparse and Discrete Interpretability for Neural
  Networks
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
40
27
0
26 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender
  Bias in an English Language Model
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
30
16
0
19 Oct 2023
Interpretable Diffusion via Information Decomposition
Interpretable Diffusion via Information Decomposition
Xianghao Kong
Ollie Liu
Han Li
Dani Yogatama
Greg Ver Steeg
29
21
0
12 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
Anna Langedijk
Hosein Mohebbi
Gabriele Sarti
Willem H. Zuidema
Jaap Jumelet
37
10
0
05 Oct 2023
Language Models Represent Space and Time
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
54
142
0
03 Oct 2023
DeepDecipher: Accessing and Investigating Neuron Activation in Large
  Language Models
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
Albert Garde
Esben Kran
Fazl Barez
26
2
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models:
  Metrics and Methods
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
41
101
0
27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
35
347
0
15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing
  Tool for BLIP
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
34
20
0
27 Aug 2023
PMET: Precise Model Editing in a Transformer
PMET: Precise Model Editing in a Transformer
Xiaopeng Li
Shasha Li
Shezheng Song
Jing Yang
Jun Ma
Jie Yu
KELM
39
119
0
17 Aug 2023
Causal interventions expose implicit situation models for commonsense
  language understanding
Causal interventions expose implicit situation models for commonsense language understanding
Takateru Yamakoshi
James L. McClelland
A. Goldberg
Robert D. Hawkins
37
6
0
06 Jun 2023
Physics of Language Models: Part 1, Learning Hierarchical Language Structures
Physics of Language Models: Part 1, Learning Hierarchical Language Structures
Zeyuan Allen-Zhu
Yuanzhi Li
40
17
0
23 May 2023
Explaining How Transformers Use Context to Build Predictions
Explaining How Transformers Use Context to Build Predictions
Javier Ferrando
Gerard I. Gállego
Ioannis Tsiamas
Marta R. Costa-jussá
37
32
0
21 May 2023
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic
  Interpretability
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu
Eric Gan
Max Tegmark
26
36
0
04 May 2023
Computational modeling of semantic change
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
43
6
0
13 Apr 2023
Localizing Model Behavior with Path Patching
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
42
85
0
12 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
27
196
0
14 Mar 2023
A Toy Model of Universality: Reverse Engineering How Networks Learn
  Group Operations
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Bilal Chughtai
Lawrence Chan
Neel Nanda
21
96
0
06 Feb 2023
Progress measures for grokking via mechanistic interpretability
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
49
386
0
12 Jan 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
Tracr: Compiled Transformers as a Laboratory for Interpretability
David Lindner
János Kramár
Sebastian Farquhar
Matthew Rahtz
Tom McGrath
Vladimir Mikulik
39
72
0
12 Jan 2023
Transformers Learn Shortcuts to Automata
Transformers Learn Shortcuts to Automata
Bingbin Liu
Jordan T. Ash
Surbhi Goel
A. Krishnamurthy
Cyril Zhang
OffRL
LRM
53
158
0
19 Oct 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
252
476
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
68
183
0
30 Aug 2022
Natural Language Descriptions of Deep Visual Features
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
206
117
0
26 Jan 2022
Previous
123