Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.05217
Cited By
v1
v2
v3 (latest)
Progress measures for grokking via mechanistic interpretability
12 January 2023
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Progress measures for grokking via mechanistic interpretability"
50 / 125 papers shown
Title
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers
Jingtong Su
Julia Kempe
Karen Ullrich
14
0
0
20 Jun 2025
Hidden Breakthroughs in Language Model Training
Sara Kangaslahti
Elan Rosenfeld
Naomi Saphra
26
0
0
18 Jun 2025
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf
Nils Feldhus
Kirill Bykov
P. Bommer
Anna Hedström
Marina M.-C. Höhne
Oliver Eberle
26
0
0
18 Jun 2025
Distinct Computations Emerge From Compositional Curricula in In-Context Learning
Jin Hwa Lee
Andrew Kyle Lampinen
Aaditya K. Singh
Andrew Saxe
25
0
0
16 Jun 2025
GrokAlign: Geometric Characterisation and Acceleration of Grokking
Thomas Walker
Ahmed Imtiaz Humayun
Randall Balestriero
Richard G. Baraniuk
32
0
0
14 Jun 2025
Model Organisms for Emergent Misalignment
Edward Turner
Anna Soligo
Mia Taylor
Senthooran Rajamanoharan
Neel Nanda
20
1
0
13 Jun 2025
Scaling Laws for Uncertainty in Deep Learning
Mattia Rosso
Simone Rossi
Giulio Franzese
Markus Heinonen
Maurizio Filippone
BDL
UQCV
92
0
0
11 Jun 2025
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban
Mohammad Taufeeque
Aaron David Tucker
Adam Gleave
Adrià Garriga-Alonso
40
0
0
11 Jun 2025
On Finetuning Tabular Foundation Models
Ivan Rubachev
Akim Kotelnikov
Nikolay Kartashev
Artem Babenko
29
0
0
10 Jun 2025
Enhancing Accuracy and Maintainability in Nuclear Plant Data Retrieval: A Function-Calling LLM Approach Over NL-to-SQL
Mishca de Costa
Muhammad Anwar
Dave Mercier
Mark Randall
Issam Hammad
26
0
0
10 Jun 2025
Addition in Four Movements: Mapping Layer-wise Information Trajectories in LLMs
Yao Yan
15
0
0
09 Jun 2025
Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs
Roy Eisenstadt
Itamar Zimerman
Lior Wolf
LRM
15
0
0
08 Jun 2025
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks
D. Kunin
Giovanni Luca Marchetti
F. Chen
Dhruva Karkada
James B. Simon
M. DeWeese
Surya Ganguli
Nina Miolane
28
0
0
06 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
Aeree Cho
Grace C. Kim
ShengYun Peng
Mansi Phute
Duen Horng Chau
LM&MA
AI4CE
72
0
0
05 Jun 2025
Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Bhavik Chandna
Zubair Bashir
Procheta Sen
87
0
0
05 Jun 2025
Bridging Neural ODE and ResNet: A Formal Error Bound for Safety Verification
Abdelrahman Sayed Sayed
Pierre-Jean Meyer
Mohamed Ghazel
29
0
0
03 Jun 2025
Tug-of-war between idiom's figurative and literal meanings in LLMs
Soyoung Oh
Xinting Huang
Mathis Pink
Michael Hahn
Vera Demberg
64
0
0
02 Jun 2025
Circuit Stability Characterizes Language Model Generalization
Alan Sun
LRM
27
0
0
30 May 2025
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten
M. E. Ildiz
Xuechen Zhang
Hrayr Harutyunyan
A. S. Rawat
Samet Oymak
LRM
69
0
0
29 May 2025
Characterising the Inductive Biases of Neural Networks on Boolean Data
Chris Mingard
Lukas Seier
Niclas Goring
Andrei-Vlad Badelita
Charles London
Ard A. Louis
AI4CE
39
0
0
29 May 2025
How Do Transformers Learn Variable Binding in Symbolic Programs?
Yiwei Wu
Atticus Geiger
Raphaël Millière
NAI
30
1
0
27 May 2025
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Stanley Yu
Vaidehi Bulusu
Oscar Yasunaga
Clayton Lau
Cole Blondin
Sean O'Brien
Kevin Zhu
Vasu Sharma
54
0
0
27 May 2025
Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
Yongjie Wang
Yibo Wang
Xin Zhou
Zhiqi Shen
62
0
0
24 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
87
1
0
23 May 2025
The emergence of sparse attention: impact of data distribution and benefits of repetition
Nicolas Zucchet
Francesco dÁngelo
Andrew Kyle Lampinen
Stephanie C. Y. Chan
214
1
0
23 May 2025
Mechanistic evaluation of Transformers and state space models
Aryaman Arora
Neil Rathi
Nikil Roashan Selvam
Róbert Csordás
Dan Jurafsky
Christopher Potts
114
1
0
21 May 2025
Beyond the Black Box: Interpretability of LLMs in Finance
Hariom Tatsat
Ariye Shater
AIFin
66
0
0
14 May 2025
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu
Kayo Yin
Michael I. Jordan
Jacob Steinhardt
Lijie Chen
145
2
0
08 May 2025
Contextures: Representations from Contexts
Runtian Zhai
Kai Yang
Che-Ping Tsai
Burak Varici
Zico Kolter
Pradeep Ravikumar
447
0
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
175
1
0
01 May 2025
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang
Qing Yang
Zhiyuan Zeng
Liliang Ren
Liu Liu
...
Jianfeng Gao
Weizhu Chen
Shuaiqiang Wang
Simon Shaolei Du
Yelong Shen
OffRL
ReLM
LRM
332
47
0
29 Apr 2025
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
Roman Abramov
Felix Steinbauer
Gjergji Kasneci
467
0
0
29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
151
0
0
28 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
116
2
0
17 Apr 2025
Towards Combinatorial Interpretability of Neural Computation
Micah Adler
Dan Alistarh
Nir Shavit
FAtt
393
2
0
10 Apr 2025
Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond
Frank Yingjie Huo
Neil F. Johnson
126
1
0
06 Apr 2025
Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction
Junlang Qian
Zixiao Zhu
Hanzhang Zhou
Zijian Feng
Zepeng Zhai
K. Mao
AAML
VLM
125
0
0
04 Apr 2025
LLM Social Simulations Are a Promising Research Method
Jacy Reese Anthis
Ryan Liu
Sean M. Richardson
Austin C. Kozlowski
Bernard Koch
James A. Evans
Erik Brynjolfsson
Michael S. Bernstein
ALM
111
15
0
03 Apr 2025
Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition
Akshay Rangamani
103
0
0
28 Mar 2025
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
Philip Quirke
Clement Neo
Abir Harrasse
Dhruv Nathawani
Luke Marks
Amir Abdullah
86
0
0
17 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
153
3
0
10 Mar 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
Yoonsoo Nam
Seok Hyeong Lee
Clementine Domine
Yea Chan Park
Charles London
Wonyl Choi
Niclas Goring
Seungjai Lee
AI4CE
209
1
0
28 Feb 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
99
2
0
27 Feb 2025
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
90
15
0
24 Feb 2025
Entropy-Lens: The Information Signature of Transformer Computations
Riccardo Ali
Francesco Caso
Christopher Irwin
Pietro Lio
106
3
0
23 Feb 2025
An explainable transformer circuit for compositional generalization
Cheng Tang
Brenden Lake
Mehrdad Jazayeri
LRM
149
3
0
19 Feb 2025
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization
Yunzhe Hu
Difan Zou
Dong Xu
157
1
0
17 Feb 2025
Early Stopping Against Label Noise Without Validation Data
Suqin Yuan
Lei Feng
Tongliang Liu
NoLa
279
19
0
11 Feb 2025
Constrained belief updates explain geometric structures in transformer representations
Mateusz Piotrowski
P. Riechers
Daniel Filan
A. Shai
130
2
0
04 Feb 2025
Modular Training of Neural Networks aids Interpretability
Satvik Golechha
Maheep Chaudhary
Joan Velja
Alessandro Abate
Nandi Schoots
153
0
0
04 Feb 2025
1
2
3
Next