Transcoders Find Interpretable LLM Feature Circuits

17 June 2024

Papers citing "Transcoders Find Interpretable LLM Feature Circuits"

26 / 26 papers shown

Title
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang Junzhe Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 72 0 0 28 Apr 2025
The Geometry of Self-Verification in a Task-Specific Reasoning Model Andrew Lee Lihao Sun Chris Wendler Fernanda Viégas Martin Wattenberg LRM 34 0 0 19 Apr 2025
Scaling sparse feature circuit finding for in-context learning Dmitrii Kharlapenko Shivalika Singh Fazl Barez Arthur Conmy Neel Nanda 26 0 0 18 Apr 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 113 1 0 10 Apr 2025
Robustly identifying concepts introduced during chat fine-tuning using crosscoders Julian Minder Clement Dumas Caden Juang Bilal Chugtai Neel Nanda 29 0 0 03 Apr 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 42 4 0 03 Mar 2025
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Maxime Méloux Silviu Maniu François Portet Maxime Peyrard 42 0 0 28 Feb 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 61 0 0 25 Feb 2025
Representation in large language models Cameron C. Yetman 41 1 0 03 Jan 2025
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 40 5 0 07 Nov 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders Zhengfu He Wentao Shu Xuyang Ge Lingjie Chen Junxuan Wang ... Qipeng Guo Xuanjing Huang Zuxuan Wu Yu-Gang Jiang Xipeng Qiu 32 13 0 27 Oct 2024
Mechanistic? Naomi Saphra Sarah Wiegreffe AI4CE 29 9 0 07 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 26 3 0 06 Sep 2024
On the Complexity of Neural Computation in Superposition Micah Adler Nir Shavit 115 3 0 05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 32 84 0 09 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 55 28 0 22 Jul 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 51 34 0 26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 45 42 0 01 Mar 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 39 20 0 19 Feb 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 188 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 460 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 318 0 21 Sep 2022