ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.05217
  4. Cited By
Progress measures for grokking via mechanistic interpretability
v1v2v3 (latest)

Progress measures for grokking via mechanistic interpretability

12 January 2023
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
ArXiv (abs)PDFHTML

Papers citing "Progress measures for grokking via mechanistic interpretability"

50 / 125 papers shown
Title
Grokking Explained: A Statistical Phenomenon
Grokking Explained: A Statistical Phenomenon
B. W. Carvalho
Artur Garcez
Luís C. Lamb
Emílio Vital Brazil
145
0
0
03 Feb 2025
What is a Number, That a Large Language Model May Know It?
What is a Number, That a Large Language Model May Know It?
Raja Marjieh
Veniamin Veselovsky
Thomas Griffiths
Ilia Sucholutsky
453
3
0
03 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
Sengim Karayalçin
Marina Krček
Stjepan Picek
AAML
159
0
0
01 Feb 2025
Training Dynamics of In-Context Learning in Linear Attention
Training Dynamics of In-Context Learning in Linear Attention
Yedi Zhang
Aaditya K. Singh
Peter E. Latham
Andrew Saxe
MLT
130
5
0
27 Jan 2025
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models
Jaturong Kongmanee
87
1
0
25 Jan 2025
Physics of Skill Learning
Physics of Skill Learning
Ziming Liu
Yizhou Liu
Eric J. Michaud
Jeff Gore
Max Tegmark
119
2
0
21 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
141
4
0
20 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
111
3
0
09 Jan 2025
Grokking at the Edge of Numerical Stability
Grokking at the Edge of Numerical Stability
Lucas Prieto
Melih Barsbey
Pedro A.M. Mediano
Tolga Birdal
133
5
0
08 Jan 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song
Zhuoyan Xu
Yiqiao Zhong
158
10
0
31 Dec 2024
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Yang Xu
Yansen Wang
Hao Wang
Hao Wang
410
4
0
23 Dec 2024
How to explain grokking
How to explain grokking
S. V. Kozyrev
AI4CE
100
0
0
17 Dec 2024
A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions
A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions
Ola Shorinwa
Zhiting Mei
Justin Lidard
Allen Z. Ren
Anirudha Majumdar
HILMLRM
153
19
0
07 Dec 2024
An In-depth Investigation of Sparse Rate Reduction in Transformer-like
  Models
An In-depth Investigation of Sparse Rate Reduction in Transformer-like Models
Yunzhe Hu
Difan Zou
Dong Xu
133
1
0
26 Nov 2024
Interacting Large Language Model Agents. Interpretable Models and Social Learning
Interacting Large Language Model Agents. Interpretable Models and Social Learning
Adit Jain
Vikram Krishnamurthy
LLMAG
150
0
0
02 Nov 2024
Learning and Unlearning of Fabricated Knowledge in Language Models
Learning and Unlearning of Fabricated Knowledge in Language Models
Chen Sun
Nolan Miller
A. Zhmoginov
Max Vladymyrov
Mark Sandler
KELMMU
60
1
0
29 Oct 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Yaniv Nikankin
Anja Reusch
Aaron Mueller
Yonatan Belinkov
AIFinLRM
129
33
0
28 Oct 2024
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Tian Gao
Amit Dhurandhar
Karthikeyan N. Ramamurthy
Dennis L. Wei
114
0
0
21 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
113
8
0
15 Oct 2024
Language Models Encode Numbers Using Digit Representations in Base 10
Language Models Encode Numbers Using Digit Representations in Base 10
Amit Arnold Levy
Mor Geva
75
8
0
15 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
116
4
0
03 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
Sondre Wold
Yun Xue
226
1
0
02 Oct 2024
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training
Kun Song
Zhiquan Tan
Bochao Zou
Jiansheng Chen
Huimin Ma
Weiran Huang
155
1
0
25 Sep 2024
Language Models "Grok" to Copy
Language Models "Grok" to Copy
Ang Lv
Ruobing Xie
Xingwu Sun
Zhanhui Kang
Rui Yan
LLMAG
144
0
0
14 Sep 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging
  Framework And Methods From Neuroscience
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
99
2
0
22 Aug 2024
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Geonhee Kim
Marco Valentino
André Freitas
LRMAI4CE
106
11
0
16 Aug 2024
Validating Mechanistic Interpretations: An Axiomatic Approach
Validating Mechanistic Interpretations: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
106
1
0
18 Jul 2024
Representing Rule-based Chatbots with Transformers
Representing Rule-based Chatbots with Transformers
Dan Friedman
Abhishek Panigrahi
Danqi Chen
156
1
0
15 Jul 2024
Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning
Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning
Augustine N. Mavor-Parker
Matthew J. Sargent
Caswell Barry
Lewis D. Griffin
Clare Lyle
118
2
0
09 Jul 2024
Anthropocentric bias in language model evaluation
Anthropocentric bias in language model evaluation
Raphael Milliere
Charles Rathkopf
80
3
0
04 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
190
33
0
02 Jul 2024
Does ChatGPT Have a Mind?
Does ChatGPT Have a Mind?
Simon Goldstein
B. Levinstein
AI4MHLRM
74
7
0
27 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP
Interpreting the Second-Order Effects of Neurons in CLIP
Yossi Gandelsman
Alexei A. Efros
Jacob Steinhardt
MILM
128
24
0
06 Jun 2024
LOLAMEME: Logic, Language, Memory, Mechanistic Framework
LOLAMEME: Logic, Language, Memory, Mechanistic Framework
Jay Desai
Xiaobo Guo
Srinivasan H. Sengamedu
44
0
0
31 May 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
99
11
0
31 May 2024
Survival of the Fittest Representation: A Case Study with Modular
  Addition
Survival of the Fittest Representation: A Case Study with Modular Addition
Xiaoman Delores Ding
Zifan Carl Guo
Eric J. Michaud
Ziming Liu
Max Tegmark
121
4
0
27 May 2024
Bayesian RG Flow in Neural Network Field Theories
Bayesian RG Flow in Neural Network Field Theories
Jessica N. Howard
Marc S. Klinger
Anindita Maiti
A. G. Stapleton
135
2
0
27 May 2024
A rationale from frequency perspective for grokking in training neural
  network
A rationale from frequency perspective for grokking in training neural network
Zhangchen Zhou
Yaoyu Zhang
Z. Xu
88
2
0
24 May 2024
How Do Transformers "Do" Physics? Investigating the Simple Harmonic
  Oscillator
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
Subhash Kantamneni
Ziming Liu
Max Tegmark
170
2
0
23 May 2024
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Lucius Bushnaq
Jake Mendel
Stefan Heimersheim
Dan Braun
Nicholas Goldowsky-Dill
Kaarel Hänni
Cindy Wu
Marius Hobbhahn
86
7
0
17 May 2024
KAN: Kolmogorov-Arnold Networks
KAN: Kolmogorov-Arnold Networks
Ziming Liu
Yixuan Wang
Sachin Vaidya
Fabian Ruehle
James Halverson
Marin Soljacic
Thomas Y. Hou
Max Tegmark
318
591
0
30 Apr 2024
Eigenpruning: an Interpretability-Inspired PEFT Method
Eigenpruning: an Interpretability-Inspired PEFT Method
Tomás Vergara-Browne
Álvaro Soto
A. Aizawa
86
1
0
04 Apr 2024
tsGT: Stochastic Time Series Modeling With Transformer
tsGT: Stochastic Time Series Modeling With Transformer
Lukasz Kuciñski
Witold Drzewakowski
Mateusz Olko
Piotr Kozakowski
Lukasz Maziarka
Marta Emilia Nowakowska
Lukasz Kaiser
Piotr Milo's
71
1
0
08 Mar 2024
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Sohee Yang
E. Gribovskaya
Nora Kassner
Mor Geva
Sebastian Riedel
ReLMLRM
131
113
0
26 Feb 2024
Towards Uncovering How Large Language Model Works: An Explainability
  Perspective
Towards Uncovering How Large Language Model Works: An Explainability Perspective
Haiyan Zhao
Fan Yang
Bo Shen
Himabindu Lakkaraju
Jundong Li
91
13
0
16 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
145
94
0
25 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills
Shiye Su
Stuart J. Russell
Scott Emmons
167
9
0
20 Dec 2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
44
52
0
14 Dec 2023
FlexModel: A Framework for Interpretability of Distributed Large
  Language Models
FlexModel: A Framework for Interpretability of Distributed Large Language Models
Matthew Choi
Muhammad Adil Asif
John Willes
David Emerson
AI4CEALM
64
1
0
05 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea
Maxime Peyrard
Martin Josifoski
Vishrav Chaudhary
Jason Eisner
Emre Kiciman
Hamid Palangi
Barun Patra
Robert West
KELM
164
15
0
04 Dec 2023
Previous
123
Next