A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

6 February 2023

Papers citing "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations"

50 / 81 papers shown

Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 80 1 0 02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 86 1 0 01 May 2025
Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model Zhiwei Xu Zhiyu Ni Yixin Wang Wei Hu CLL 37 0 0 17 Apr 2025
From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP Fabio Yáñez-Romero Andrés Montoyo Armando Suárez Yoan Gutiérrez Ruslan Mitkov 44 0 0 02 Apr 2025
Shared Global and Local Geometry of Language Model Embeddings Andrew Lee Melanie Weber F. Viégas Martin Wattenberg FedML 74 1 0 27 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 73 2 0 10 Mar 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Vishnu Kabir Chhabra Ding Zhu Mohammad Mahdi Khalili 37 2 0 27 Feb 2025
Learning the symmetric group: large from small Max Petschack Alexandr Garbali Jan de Gier AAML 52 0 0 18 Feb 2025
Generative Modeling on Lie Groups via Euclidean Generalized Score Matching Marco Bertolini Tuan Le Djork-Arné Clevert DiffM 86 0 0 04 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis Sengim Karayalçin Marina Krček Stjepan Picek AAML 75 0 0 01 Feb 2025
Grokking at the Edge of Numerical Stability Lucas Prieto Melih Barsbey Pedro A.M. Mediano Tolga Birdal 40 3 0 08 Jan 2025
Exploring Grokking: Experimental and Mechanistic Investigations Hu Qiye Zhou Hao Yu RuoXi 71 1 0 14 Dec 2024
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes Jesse He Helen Jenne Herman Chau Davis Brown Mark Raugas Sara Billey Henry Kvinge 26 3 0 12 Nov 2024
Tracking Universal Features Through Fine-Tuning and Model Merging Niels Horn Desmond Elliott MoMe 31 0 0 16 Oct 2024
A Theoretical Survey on Foundation Models Shi Fu Yuzhu Chen Yingjie Wang Dacheng Tao 28 0 0 15 Oct 2024
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures Junxuan Wang Xuyang Ge Wentao Shu Qiong Tang Yunhua Zhou Zhengfu He Xipeng Qiu 29 7 0 09 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 43 10 0 09 Oct 2024
Grokking at the Edge of Linear Separability Alon Beck Noam Levi Yohai Bar-Sinai 31 0 0 06 Oct 2024
Relative Representations: Topological and Geometric Perspectives Alejandro García-Castellanos G. Marchetti Danica Kragic Martina Scolamiero 48 0 0 17 Sep 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 44 18 0 02 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 52 28 0 22 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 58 3 0 19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 41 1 0 18 Jul 2024
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent Karolis Jucys George Adamopoulos Mehrab Hamidi Stephanie Milani Mohammad Reza Samsami Artem Zholus Sonia Joseph Blake A. Richards Irina Rish Özgür Simsek 42 2 0 16 Jul 2024
Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities Nhat Dinh Minh Le Ciyue Shen Chintan Shah Blake Martin Daniel Shenker ... Jennifer A. Hipp S. Grullon J. Abel Harsha Pokkalla Dinkar Juyal 19 3 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 50 7 0 11 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller CML 30 10 0 05 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 26 17 0 25 Jun 2024
MD tree: a model-diagnostic tree grown on loss landscape Yefan Zhou Jianlong Chen Qinxue Cao Konstantin Schürholt Yaoqing Yang 31 2 0 24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 33 1 0 24 Jun 2024
Weight-based Decomposition: A Case for Bilinear MLPs Michael T. Pearce Thomas Dooms Alice Rigg 42 1 0 06 Jun 2024
Pre-trained Large Language Models Use Fourier Features to Compute Addition Tianyi Zhou Deqing Fu Vatsal Sharan Robin Jia LRM 34 9 0 05 Jun 2024
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation Géraldin Nanfack Michael Eickenberg Eugene Belilovsky FAtt AAML GNN 32 0 0 03 Jun 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 48 3 0 27 May 2024
Acceleration of Grokking in Learning Arithmetic Operations via Kolmogorov-Arnold Representation Yeachan Park Minseok Kim Yeoneung Kim 29 1 0 26 May 2024
A rationale from frequency perspective for grokking in training neural network Zhangchen Zhou Yaoyu Zhang Z. Xu 40 2 0 24 May 2024
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator Subhash Kantamneni Ziming Liu Max Tegmark 14 2 0 23 May 2024
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Lucius Bushnaq Stefan Heimersheim Nicholas Goldowsky-Dill Dan Braun Jake Mendel Kaarel Hänni Avery Griffin Jörn Stöhler Magdalena Wache Marius Hobbhahn FAtt 33 3 0 17 May 2024
Can Language Models Explain Their Own Classification Behavior? Dane Sherburn Bilal Chughtai Owain Evans 42 1 0 13 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 112 0 22 Apr 2024
tsGT: Stochastic Time Series Modeling With Transformer Lukasz Kuciñski Witold Drzewakowski Mateusz Olko Piotr Kozakowski Lukasz Maziarka Marta Emilia Nowakowska Lukasz Kaiser Piotr Milo's 49 1 0 08 Mar 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition Yufei Huang Shengding Hu Xu Han Zhiyuan Liu Maosong Sun 64 14 0 23 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash Tamar Rott Shaham Tal Haklay Yonatan Belinkov David Bau 43 52 0 22 Feb 2024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai Alan Cooney Neel Nanda HILM KELM 30 16 0 11 Feb 2024
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion Dylan Zhang Curt Tigges Zory Zhang Stella Biderman Maxim Raginsky Talia Ringer 24 11 0 23 Jan 2024
From Understanding to Utilization: A Survey on Explainability for Large Language Models Haoyan Luo Lucia Specia 48 20 0 23 Jan 2024
Universal Neurons in GPT2 Language Models Wes Gurnee Theo Horsley Zifan Carl Guo Tara Rezaei Kheirkhah Qinyi Sun Will Hathaway Neel Nanda Dimitris Bertsimas MILM 96 37 0 22 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild Rhys Gould Euan Ong George Ogden Arthur Conmy LRM 13 44 0 14 Dec 2023
Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks G. Marchetti Christopher Hillar Danica Kragic Sophia Sanborn 25 12 0 13 Dec 2023