Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

19 February 2024

Tianxiang Sun

Xipeng Qiu

ArXiv (abs)PDF HTML

Papers citing "Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT"

15 / 15 papers shown

Title
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning Lefei Zhang Lijie Hu Di Wang LRM 163 4 0 17 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 97 2 0 09 Jan 2025
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 76 5 0 06 Sep 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 124 23 0 28 May 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 189 112 0 27 Sep 2023
White-Box Transformers via Sparse Rate Reduction Yaodong Yu Sam Buchanan Druv Pai Tianzhe Chu Ziyang Wu Shengbang Tong B. Haeffele Yi Ma ViT 87 86 0 01 Jun 2023
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 316 516 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 183 368 0 21 Sep 2022
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale Tim Dettmers M. Lewis Younes Belkada Luke Zettlemoyer MQ 98 653 0 15 Aug 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 248 1,357 0 10 Feb 2022
Word Embedding Visualization Via Dictionary Learning Juexiao Zhang Yubei Chen Brian Cheung Bruno A. Olshausen 40 16 0 09 Oct 2019
Deep Learning using Rectified Linear Units (ReLU) Abien Fred Agarap 64 3,229 0 22 Mar 2018
Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning Stefan Elfwing E. Uchibe Kenji Doya 133 1,728 0 10 Feb 2017
Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora Yuanzhi Li Yingyu Liang Tengyu Ma Andrej Risteski 80 283 0 14 Jan 2016
Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov Ilya Sutskever Kai Chen G. Corrado J. Dean NAI OCL 397 33,550 0 16 Oct 2013