Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.14997
Cited By
Towards Automated Circuit Discovery for Mechanistic Interpretability
28 April 2023
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Automated Circuit Discovery for Mechanistic Interpretability"
50 / 76 papers shown
Title
Tracr-Injection: Distilling Algorithms into Pre-trained Language Models
Tomás Vergara-Browne
Álvaro Soto
12
0
0
15 May 2025
Guiding Evolutionary AutoEncoder Training with Activation-Based Pruning Operators
Steven Jorgensen
Erik Hemberg
J. Toutouh
Una-May O’Reilly
49
0
0
08 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
80
1
0
02 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
Junzhe Zhang
Xipeng Qiu
70
0
0
29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
72
0
0
28 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Tyler A. Chang
Benjamin Bergen
50
0
0
21 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
30
0
0
18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
43
1
0
17 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Yixin Cao
Jiahao Ying
Yixuan Wang
Xipeng Qiu
Xuanjing Huang
Yugang Jiang
ELM
41
2
0
10 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
Hari Sethuraman
Dheeraj Rajagopal
Hannaneh Hajishirzi
Sachin Kumar
LLMSV
29
1
0
06 Apr 2025
Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
47
0
0
14 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
71
0
0
13 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
68
0
0
08 Mar 2025
How can representation dimension dominate structurally pruned LLMs?
Mingxue Xu
Lisa Alazraki
Danilo P. Mandic
56
0
0
06 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
60
0
0
03 Mar 2025
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Ruta Binkyte
Ivaxi Sheth
Zhijing Jin
Mohammad Havaei
Bernhard Schölkopf
Mario Fritz
134
0
0
28 Feb 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
67
0
0
24 Feb 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev
Matvey Mikhalchuk
Temurbek Rahmatullaev
Elizaveta Goncharova
Polina Druzhinina
Ivan V. Oseledets
Andrey Kuznetsov
64
2
0
20 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang
Tessa Han
Usha Bhalla
Hima Lakkaraju
FAtt
147
0
0
17 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
L. Zhang
Lijie Hu
Di Wang
LRM
95
0
0
17 Feb 2025
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Da Xiao
Qingye Meng
Shengping Li
Xingyuan Yuan
MoE
AI4CE
66
1
0
13 Feb 2025
Modular Training of Neural Networks aids Interpretability
Satvik Golechha
Maheep Chaudhary
Joan Velja
Alessandro Abate
Nandi Schoots
79
0
0
04 Feb 2025
Constrained belief updates explain geometric structures in transformer representations
Mateusz Piotrowski
P. Riechers
Daniel Filan
A. Shai
74
0
0
04 Feb 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference
Go Kamoda
Benjamin Heinzerling
Tatsuro Inaba
Keito Kudo
Keisuke Sakaguchi
Kentaro Inui
MILM
33
0
0
27 Jan 2025
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Yang Xu
Yixuan Wang
Hao Wang
114
1
0
23 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
136
0
0
17 Nov 2024
A Mechanistic Explanatory Strategy for XAI
Marcin Rabiza
51
1
0
02 Nov 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning
John Wu
David Wu
Jimeng Sun
52
1
0
31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip H. S. Torr
Francesco Pinto
47
0
0
30 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Fan Zhang
Yongbin Li
59
5
0
17 Oct 2024
Interpreting token compositionality in LLMs: A robustness analysis
Nura Aljaafari
Danilo S. Carvalho
André Freitas
30
0
0
16 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
52
7
0
10 Oct 2024
Unlearning-based Neural Interpretations
Ching Lam Choi
Alexandre Duplessis
Serge Belongie
FAtt
44
0
0
10 Oct 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann
Alexander Spiridonov
Robin Staab
Nikola Jovanović
Mark Vero
...
Mislav Balunović
Nikola Konstantinov
Pavol Bielik
Petar Tsankov
Martin Vechev
ELM
47
4
0
10 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang
Anish Kachinthaya
Suzie Petryk
Yossi Gandelsman
VLM
34
15
0
03 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
Sondre Wold
Barbara Plank
34
0
0
02 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
2
0
30 Sep 2024
Training Neural Networks for Modularity aids Interpretability
Satvik Golechha
Dylan R. Cope
Nandi Schoots
30
0
0
24 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
26
3
0
06 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
51
6
0
03 Sep 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs
Nitay Calderon
Roi Reichart
40
10
0
27 Jul 2024
Representing Rule-based Chatbots with Transformers
Dan Friedman
Abhishek Panigrahi
Danqi Chen
63
1
0
15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
50
7
0
11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
62
17
0
24 Jun 2024
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
64
20
0
28 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
Jacob Russin
Sam Whitman McGrath
Danielle J. Williams
Lotem Elber-Dorozko
AI4CE
73
3
0
24 May 2024
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
Emily Cheng
Diego Doimo
Corentin Kervadec
Iuri Macocco
Jade Yu
A. Laio
Marco Baroni
112
11
0
24 May 2024
Learned feature representations are biased by complexity, learning order, position, and more
Andrew Kyle Lampinen
Stephanie C. Y. Chan
Katherine Hermann
AI4CE
FaML
SSL
OOD
34
6
0
09 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge?
Jingcheng Niu
Andrew Liu
Zining Zhu
Gerald Penn
48
31
0
03 May 2024
1
2
Next