Towards Automated Circuit Discovery for Mechanistic Interpretability

28 April 2023

Arthur Conmy

Augustine N. Mavor-Parker

Papers citing "Towards Automated Circuit Discovery for Mechanistic Interpretability"

27 / 77 papers shown

Title
A Multimodal Automated Interpretability Agent Tamar Rott Shaham Sarah Schwettmann Franklin Wang Achyuta Rajaram Evan Hernandez Jacob Andreas Antonio Torralba 34 17 0 22 Apr 2024
Does Transformer Interpretability Transfer to RNNs? Gonccalo Paulo Thomas Marshall Nora Belrose 63 6 0 09 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 46 115 0 28 Mar 2024
The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models Carlo Nicolini Jacopo Staiano Bruno Lepri Raffaele Marino MoE 34 1 0 13 Mar 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 23 12 0 07 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 34 87 0 11 Jan 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 75 7 0 07 Nov 2023
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems David A. Klindt Sophia Sanborn Francisco Acosta Frédéric Poitevin Nina Miolane MILM FAtt 44 7 0 17 Oct 2023
Interpretable Diffusion via Information Decomposition Xianghao Kong Ollie Liu Han Li Dani Yogatama Greg Ver Steeg 24 20 0 12 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 36 97 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 33 335 0 15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 34 20 0 27 Aug 2023
Explaining black box text modules in natural language with language models Chandan Singh Aliyah R. Hsu Richard Antonello Shailee Jain Alexander G. Huth Bin-Xia Yu Jianfeng Gao MILM 34 47 0 17 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 188 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 34 6 0 13 Apr 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 31 85 0 12 Apr 2023
Language Model Crossover: Variation through Few-Shot Prompting Elliot Meyerson M. Nelson Herbie Bradley Adam Gaier Arash Moradi Amy K. Hoover Joel Lehman VLM 31 79 0 23 Feb 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability David Lindner János Kramár Sebastian Farquhar Matthew Rahtz Tom McGrath Vladimir Mikulik 29 72 0 12 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 460 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 318 0 21 Sep 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 204 117 0 26 Jan 2022
Causal Distillation for Language Models Zhengxuan Wu Atticus Geiger J. Rozner Elisa Kreiss Hanson Lu Thomas F. Icard Christopher Potts Noah D. Goodman 89 25 0 05 Dec 2021
A Survey on Neural Network Interpretability Yu Zhang Peter Tiño A. Leonardis K. Tang FaML XAI 144 661 0 28 Dec 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 261 4,489 0 23 Jan 2020
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 257 3,684 0 28 Feb 2017