Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.05062
Cited By
v1
v2
v3
v4
v5 (latest)
Tracr: Compiled Transformers as a Laboratory for Interpretability
12 January 2023
David Lindner
János Kramár
Sebastian Farquhar
Matthew Rahtz
Tom McGrath
Vladimir Mikulik
Re-assign community
ArXiv (abs)
PDF
HTML
Github (533★)
Papers citing
"Tracr: Compiled Transformers as a Laboratory for Interpretability"
40 / 40 papers shown
Title
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Hao Chen
Haoze Li
Zhiqing Xiao
Lirong Gao
Qi Zhang
Xiaomeng Hu
Ningtao Wang
Xing Fu
Junbo Zhao
187
0
0
24 May 2025
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning
Jingcheng Niu
Subhabrata Dutta
Ahmed Elshabrawy
Harish Tayyar Madabushi
Iryna Gurevych
131
1
0
16 May 2025
Tracr-Injection: Distilling Algorithms into Pre-trained Language Models
Tomás Vergara-Browne
Álvaro Soto
189
0
0
15 May 2025
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers
Yingyu Liang
Zhizhou Sha
Zhenmei Shi
Zhao Song
Yufa Zhou
165
19
0
21 Feb 2025
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
139
10
0
17 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
Sondre Wold
Yun Xue
209
1
0
02 Oct 2024
Representing Rule-based Chatbots with Transformers
Dan Friedman
Abhishek Panigrahi
Danqi Chen
135
1
0
15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
174
33
0
02 Jul 2024
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
204
20
0
24 Jun 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
279
3
0
28 Dec 2023
Learning Transformer Programs
Dan Friedman
Alexander Wettig
Danqi Chen
69
36
0
01 Jun 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
66
319
0
28 Apr 2023
Looped Transformers as Programmable Computers
Angeliki Giannou
Shashank Rajput
Jy-yong Sohn
Kangwook Lee
Jason D. Lee
Dimitris Papailiopoulos
91
106
0
30 Jan 2023
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
90
450
0
12 Jan 2023
What learning algorithm is in-context learning? Investigations with linear models
Ekin Akyürek
Dale Schuurmans
Jacob Andreas
Tengyu Ma
Denny Zhou
119
493
0
28 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
316
563
0
01 Nov 2022
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
205
32
0
04 Oct 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
323
528
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
198
380
0
21 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
97
134
0
27 Jul 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
251
1,389
0
10 Feb 2022
Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers
Colin Wei
Yining Chen
Tengyu Ma
59
92
0
28 Jul 2021
Robust Explainability: A Tutorial on Gradient-Based Attribution Methods for Deep Neural Networks
Ian E. Nielsen
Dimah Dera
Ghulam Rasool
N. Bouaynaya
R. Ramachandran
FAtt
71
82
0
23 Jul 2021
Saturated Transformers are Constant-Depth Threshold Circuits
William Merrill
Ashish Sabharwal
Noah A. Smith
100
107
0
30 Jun 2021
Thinking Like Transformers
Gail Weiss
Yoav Goldberg
Eran Yahav
AI4CE
121
135
0
13 Jun 2021
Do Feature Attribution Methods Correctly Attribute Features?
Yilun Zhou
Serena Booth
Marco Tulio Ribeiro
J. Shah
FAtt
XAI
91
135
0
27 Apr 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
303
456
0
24 Feb 2021
Debugging Tests for Model Explanations
Julius Adebayo
M. Muelly
Ilaria Liccardi
Been Kim
FAtt
78
181
0
10 Nov 2020
Towards falsifiable interpretability research
Matthew L. Leavitt
Ari S. Morcos
AAML
AI4CE
80
68
0
22 Oct 2020
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?
Alon Jacovi
Yoav Goldberg
XAI
134
600
0
07 Apr 2020
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
101
1,503
0
27 Feb 2020
Benchmarking Attribution Methods with Relative Feature Importance
Mengjiao Yang
Been Kim
FAtt
XAI
73
142
0
23 Jul 2019
Evaluating Explanation Without Ground Truth in Interpretable Machine Learning
Fan Yang
Mengnan Du
Helen Zhou
XAI
ELM
67
67
0
16 Jul 2019
Analysis Methods in Neural Language Processing: A Survey
Yonatan Belinkov
James R. Glass
95
558
0
21 Dec 2018
Deep Learning using Rectified Linear Units (ReLU)
Abien Fred Agarap
81
3,241
0
22 Mar 2018
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
805
132,725
0
12 Jun 2017
Network Dissection: Quantifying Interpretability of Deep Visual Representations
David Bau
Bolei Zhou
A. Khosla
A. Oliva
Antonio Torralba
MILM
FAtt
158
1,526
1
19 Apr 2017
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
435
10,541
0
21 Jul 2016
Gaussian Error Linear Units (GELUs)
Dan Hendrycks
Kevin Gimpel
174
5,049
0
27 Jun 2016
Going Deeper with Convolutions
Christian Szegedy
Wei Liu
Yangqing Jia
P. Sermanet
Scott E. Reed
Dragomir Anguelov
D. Erhan
Vincent Vanhoucke
Andrew Rabinovich
496
43,717
0
17 Sep 2014
1