Towards Automated Circuit Discovery for Mechanistic Interpretability

28 April 2023

Arthur Conmy

Augustine N. Mavor-Parker

Papers citing "Towards Automated Circuit Discovery for Mechanistic Interpretability"

50 / 76 papers shown

Title
Tracr-Injection: Distilling Algorithms into Pre-trained Language Models Tomás Vergara-Browne Álvaro Soto 12 0 0 15 May 2025
Guiding Evolutionary AutoEncoder Training with Activation-Based Pruning Operators Steven Jorgensen Erik Hemberg J. Toutouh Una-May O’Reilly 49 0 0 08 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 80 1 0 02 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang Junzhe Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 72 0 0 28 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models Tyler A. Chang Benjamin Bergen 50 0 0 21 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 30 0 0 18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 43 1 0 17 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric Yixin Cao Jiahao Ying Yixuan Wang Xipeng Qiu Xuanjing Huang Yugang Jiang ELM 41 2 0 10 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models Patrick Queiroz Da Silva Hari Sethuraman Dheeraj Rajagopal Hannaneh Hajishirzi Sachin Kumar LLMSV 29 1 0 06 Apr 2025
Are formal and functional linguistic mechanisms dissociated in language models? Michael Hanna Sandro Pezzelle Yonatan Belinkov 47 0 0 14 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 71 0 0 13 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 68 0 0 08 Mar 2025
How can representation dimension dominate structurally pruned LLMs? Mingxue Xu Lisa Alazraki Danilo P. Mandic 56 0 0 06 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Jonathan Jacobi Gal Niv LRM ReLM 60 0 0 03 Mar 2025
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models Ruta Binkyte Ivaxi Sheth Zhijing Jin Mohammad Havaei Bernhard Schölkopf Mario Fritz 134 0 0 28 Feb 2025
Model Lakes Koyena Pal David Bau Renée J. Miller 67 0 0 24 Feb 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Anton Razzhigaev Matvey Mikhalchuk Temurbek Rahmatullaev Elizaveta Goncharova Polina Druzhinina Ivan V. Oseledets Andrey Kuznetsov 64 2 0 20 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 147 0 0 17 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning L. Zhang Lijie Hu Di Wang LRM 95 0 0 17 Feb 2025
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections Da Xiao Qingye Meng Shengping Li Xingyuan Yuan MoE AI4CE 66 1 0 13 Feb 2025
Modular Training of Neural Networks aids Interpretability Satvik Golechha Maheep Chaudhary Joan Velja Alessandro Abate Nandi Schoots 79 0 0 04 Feb 2025
Constrained belief updates explain geometric structures in transformer representations Mateusz Piotrowski P. Riechers Daniel Filan A. Shai 74 0 0 04 Feb 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference Go Kamoda Benjamin Heinzerling Tatsuro Inaba Keito Kudo Keisuke Sakaguchi Kentaro Inui MILM 33 0 0 27 Jan 2025
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study Yang Xu Yixuan Wang Hao Wang 114 1 0 23 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Zeping Yu Sophia Ananiadou 136 0 0 17 Nov 2024
A Mechanistic Explanatory Strategy for XAI Marcin Rabiza 51 1 0 02 Nov 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu David Wu Jimeng Sun 52 1 0 31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip H. S. Torr Francesco Pinto 47 0 0 30 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Fan Zhang Yongbin Li 59 5 0 17 Oct 2024
Interpreting token compositionality in LLMs: A robustness analysis Nura Aljaafari Danilo S. Carvalho André Freitas 30 0 0 16 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 52 7 0 10 Oct 2024
Unlearning-based Neural Interpretations Ching Lam Choi Alexandre Duplessis Serge Belongie FAtt 44 0 0 10 Oct 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act Philipp Guldimann Alexander Spiridonov Robin Staab Nikola Jovanović Mark Vero ... Mislav Balunović Nikola Konstantinov Pavol Bielik Petar Tsankov Martin Vechev ELM 47 4 0 10 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations Nick Jiang Anish Kachinthaya Suzie Petryk Yossi Gandelsman VLM 34 15 0 03 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Philipp Mondorf Sondre Wold Barbara Plank 34 0 0 02 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Training Neural Networks for Modularity aids Interpretability Satvik Golechha Dylan R. Cope Nandi Schoots 30 0 0 24 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 26 3 0 06 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning Wei Chen Zhen Huang Liang Xie Binbin Lin Houqiang Li ... Deng Cai Yonggang Zhang Wenxiao Wang Xu Shen Jieping Ye 51 6 0 03 Sep 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 40 10 0 27 Jul 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 63 1 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 50 7 0 11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 62 17 0 24 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 64 20 0 28 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks Jacob Russin Sam Whitman McGrath Danielle J. Williams Lotem Elber-Dorozko AI4CE 73 3 0 24 May 2024
Emergence of a High-Dimensional Abstraction Phase in Language Transformers Emily Cheng Diego Doimo Corentin Kervadec Iuri Macocco Jade Yu A. Laio Marco Baroni 112 11 0 24 May 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 34 6 0 09 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge? Jingcheng Niu Andrew Liu Zining Zhu Gerald Penn 48 31 0 03 May 2024