v1v2v3 (latest)

Progress measures for grokking via mechanistic interpretability

12 January 2023

Papers citing "Progress measures for grokking via mechanistic interpretability"

25 / 125 papers shown

Title
Labeling Neural Representations with Inverse Recognition Kirill Bykov Laura Kopf Shinichi Nakajima Marius Kloft Marina M.-C. Höhne BDL 122 20 0 22 Nov 2023
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks Ben Feuer Chinmay Hegde Niv Cohen 108 11 0 17 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 161 7 0 07 Nov 2023
Explainable Artificial Intelligence (XAI) 2.0: A Manifesto of Open Challenges and Interdisciplinary Research Directions Luca Longo Mario Brcic Federico Cabitza Jaesik Choi Roberto Confalonieri ... Andrés Páez Wojciech Samek Johannes Schneider Timo Speith Simone Stumpf 150 226 0 30 Oct 2023
In-Context Learning Dynamics with Random Binary Sequences Eric J. Bigelow Ekdeep Singh Lubana Robert P. Dick Hidenori Tanaka T. Ullman 92 4 0 26 Oct 2023
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models Buse Giledereli Jiaoda Li Yu Fei Alessandro Stolfo Wangchunshu Zhou Guangtao Zeng Antoine Bosselut Mrinmaya Sachan LRM 124 47 0 23 Oct 2023
When can transformers reason with abstract symbols? Enric Boix-Adserà Omid Saremi Emmanuel Abbe Samy Bengio Etai Littwin Josh Susskind LRM NAI 66 17 0 15 Oct 2023
Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations Alexa R. Tartaglini Sheridan Feucht Michael A. Lepori Wai Keen Vong Charles Lovering Brenden M. Lake Ellie Pavlick ViT OOD 57 4 0 14 Oct 2023
Measuring Feature Sparsity in Language Models Mingyang Deng Lucas Tao Joe Benton 61 1 0 11 Oct 2023
Phase codes emerge in recurrent neural networks optimized for modular arithmetic Keith T. Murray 25 1 0 11 Oct 2023
Interpreting CLIP's Image Representation via Text-Based Decomposition Yossi Gandelsman Alexei A. Efros Jacob Steinhardt VLM 84 101 0 09 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 205 115 0 27 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation Maximilian Li Xander Davies Max Nadeau KELM MU 75 29 0 12 Sep 2023
FIND: A Function Description Benchmark for Evaluating Interpretability Methods Sarah Schwettmann Tamar Rott Shaham Joanna Materzyñska Neil Chowdhury Shuang Li Jacob Andreas David Bau Antonio Torralba 56 22 0 07 Sep 2023
NeuroSurgeon: A Toolkit for Subnetwork Analysis Michael A. Lepori Ellie Pavlick Thomas Serre 83 7 0 01 Sep 2023
Large Language Models Michael R Douglas LLMAG LM&MA 177 645 0 11 Jul 2023
Towards Regulatable AI Systems: Technical Gaps and Policy Opportunities Xudong Shen H. Brown Jiashu Tao Martin Strobel Yao Tong Akshay Narayan Harold Soh Finale Doshi-Velez 98 3 0 22 Jun 2023
Schema-learning and rebinding as mechanisms of in-context learning and emergence Siva K. Swaminathan Antoine Dedieu Rajkumar Vasudeva Raju Murray Shanahan Miguel Lazaro-Gredilla Dileep George 97 14 0 16 Jun 2023
Adversarial Attacks on the Interpretation of Neuron Activation Maximization Géraldin Nanfack A. Fulleringer Jonathan Marty Michael Eickenberg Eugene Belilovsky AAML FAtt 69 11 0 12 Jun 2023
Birth of a Transformer: A Memory Viewpoint A. Bietti Vivien A. Cabannes Diane Bouchacourt Hervé Jégou Léon Bottou 112 96 0 01 Jun 2023
Physics of Language Models: Part 1, Learning Hierarchical Language Structures Zeyuan Allen-Zhu Yuanzhi Li 112 21 0 23 May 2023
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability Ziming Liu Eric Gan Max Tegmark 82 40 0 04 May 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 68 319 0 28 Apr 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability David Lindner János Kramár Sebastian Farquhar Matthew Rahtz Tom McGrath Vladimir Mikulik 130 75 0 12 Jan 2023
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 115 85 0 03 Oct 2022