ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.03025
  4. Cited By
A Toy Model of Universality: Reverse Engineering How Networks Learn
  Group Operations

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

6 February 2023
Bilal Chughtai
Lawrence Chan
Neel Nanda
ArXivPDFHTML

Papers citing "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations"

50 / 81 papers shown
Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
80
1
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
86
1
0
01 May 2025
Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model
Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model
Zhiwei Xu
Zhiyu Ni
Yixin Wang
Wei Hu
CLL
37
0
0
17 Apr 2025
From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP
From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP
Fabio Yáñez-Romero
Andrés Montoyo
Armando Suárez
Yoan Gutiérrez
Ruslan Mitkov
44
0
0
02 Apr 2025
Shared Global and Local Geometry of Language Model Embeddings
Shared Global and Local Geometry of Language Model Embeddings
Andrew Lee
Melanie Weber
F. Viégas
Martin Wattenberg
FedML
74
1
0
27 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
73
2
0
10 Mar 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
37
2
0
27 Feb 2025
Learning the symmetric group: large from small
Learning the symmetric group: large from small
Max Petschack
Alexandr Garbali
Jan de Gier
AAML
52
0
0
18 Feb 2025
Generative Modeling on Lie Groups via Euclidean Generalized Score Matching
Generative Modeling on Lie Groups via Euclidean Generalized Score Matching
Marco Bertolini
Tuan Le
Djork-Arné Clevert
DiffM
86
0
0
04 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
Sengim Karayalçin
Marina Krček
Stjepan Picek
AAML
75
0
0
01 Feb 2025
Grokking at the Edge of Numerical Stability
Grokking at the Edge of Numerical Stability
Lucas Prieto
Melih Barsbey
Pedro A.M. Mediano
Tolga Birdal
40
3
0
08 Jan 2025
Exploring Grokking: Experimental and Mechanistic Investigations
Exploring Grokking: Experimental and Mechanistic Investigations
Hu Qiye
Zhou Hao
Yu RuoXi
71
1
0
14 Dec 2024
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver
  Mutation Classes
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes
Jesse He
Helen Jenne
Herman Chau
Davis Brown
Mark Raugas
Sara Billey
Henry Kvinge
26
3
0
12 Nov 2024
Tracking Universal Features Through Fine-Tuning and Model Merging
Tracking Universal Features Through Fine-Tuning and Model Merging
Niels Horn
Desmond Elliott
MoMe
31
0
0
16 Oct 2024
A Theoretical Survey on Foundation Models
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
28
0
0
15 Oct 2024
Towards Universality: Studying Mechanistic Similarity Across Language
  Model Architectures
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Junxuan Wang
Xuyang Ge
Wentao Shu
Qiong Tang
Yunhua Zhou
Zhengfu He
Xipeng Qiu
29
7
0
09 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip H. S. Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
10
0
09 Oct 2024
Grokking at the Edge of Linear Separability
Grokking at the Edge of Linear Separability
Alon Beck
Noam Levi
Yohai Bar-Sinai
31
0
0
06 Oct 2024
Relative Representations: Topological and Geometric Perspectives
Relative Representations: Topological and Geometric Perspectives
Alejandro García-Castellanos
G. Marchetti
Danica Kragic
Martina Scolamiero
48
0
0
17 Sep 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical
  Grounding of Causal Interpretability
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
44
18
0
02 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
52
28
0
22 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic
  Interpretability Techniques
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
58
3
0
19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An
  Axiomatic Approach
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
41
1
0
18 Jul 2024
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft
  Agent
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys
George Adamopoulos
Mehrab Hamidi
Stephanie Milani
Mohammad Reza Samsami
Artem Zholus
Sonia Joseph
Blake A. Richards
Irina Rish
Özgür Simsek
42
2
0
16 Jul 2024
Interpretability analysis on a pathology foundation model reveals
  biologically relevant embeddings across modalities
Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities
Nhat Dinh Minh Le
Ciyue Shen
Chintan Shah
Blake Martin
Daniel Shenker
...
Jennifer A. Hipp
S. Grullon
J. Abel
Harsha Pokkalla
Dinkar Juyal
19
3
0
15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
50
7
0
11 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for
  Interpreting Neural Networks
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
30
10
0
05 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
26
17
0
25 Jun 2024
MD tree: a model-diagnostic tree grown on loss landscape
MD tree: a model-diagnostic tree grown on loss landscape
Yefan Zhou
Jianlong Chen
Qinxue Cao
Konstantin Schürholt
Yaoqing Yang
31
2
0
24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
33
1
0
24 Jun 2024
Weight-based Decomposition: A Case for Bilinear MLPs
Weight-based Decomposition: A Case for Bilinear MLPs
Michael T. Pearce
Thomas Dooms
Alice Rigg
42
1
0
06 Jun 2024
Pre-trained Large Language Models Use Fourier Features to Compute
  Addition
Pre-trained Large Language Models Use Fourier Features to Compute Addition
Tianyi Zhou
Deqing Fu
Vatsal Sharan
Robin Jia
LRM
34
9
0
05 Jun 2024
From Feature Visualization to Visual Circuits: Effect of Adversarial
  Model Manipulation
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation
Géraldin Nanfack
Michael Eickenberg
Eugene Belilovsky
FAtt
AAML
GNN
32
0
0
03 Jun 2024
Survival of the Fittest Representation: A Case Study with Modular
  Addition
Survival of the Fittest Representation: A Case Study with Modular Addition
Xiaoman Delores Ding
Zifan Carl Guo
Eric J. Michaud
Ziming Liu
Max Tegmark
48
3
0
27 May 2024
Acceleration of Grokking in Learning Arithmetic Operations via
  Kolmogorov-Arnold Representation
Acceleration of Grokking in Learning Arithmetic Operations via Kolmogorov-Arnold Representation
Yeachan Park
Minseok Kim
Yeoneung Kim
29
1
0
26 May 2024
A rationale from frequency perspective for grokking in training neural
  network
A rationale from frequency perspective for grokking in training neural network
Zhangchen Zhou
Yaoyu Zhang
Z. Xu
40
2
0
24 May 2024
How Do Transformers "Do" Physics? Investigating the Simple Harmonic
  Oscillator
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
Subhash Kantamneni
Ziming Liu
Max Tegmark
14
2
0
23 May 2024
The Local Interaction Basis: Identifying Computationally-Relevant and
  Sparsely Interacting Features in Neural Networks
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq
Stefan Heimersheim
Nicholas Goldowsky-Dill
Dan Braun
Jake Mendel
Kaarel Hänni
Avery Griffin
Jörn Stöhler
Magdalena Wache
Marius Hobbhahn
FAtt
33
3
0
17 May 2024
Can Language Models Explain Their Own Classification Behavior?
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
42
1
0
13 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
112
0
22 Apr 2024
tsGT: Stochastic Time Series Modeling With Transformer
tsGT: Stochastic Time Series Modeling With Transformer
Lukasz Kuciñski
Witold Drzewakowski
Mateusz Olko
Piotr Kozakowski
Lukasz Maziarka
Marta Emilia Nowakowska
Lukasz Kaiser
Piotr Milo's
49
1
0
08 Mar 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A
  Perspective from Circuits Competition
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang
Shengding Hu
Xu Han
Zhiyuan Liu
Maosong Sun
64
14
0
23 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity
  Tracking
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Nikhil Prakash
Tamar Rott Shaham
Tal Haklay
Yonatan Belinkov
David Bau
43
52
0
22 Feb 2024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
Bilal Chughtai
Alan Cooney
Neel Nanda
HILM
KELM
30
16
0
11 Feb 2024
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate
  Structural Recursion
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion
Dylan Zhang
Curt Tigges
Zory Zhang
Stella Biderman
Maxim Raginsky
Talia Ringer
24
11
0
23 Jan 2024
From Understanding to Utilization: A Survey on Explainability for Large
  Language Models
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Haoyan Luo
Lucia Specia
48
20
0
23 Jan 2024
Universal Neurons in GPT2 Language Models
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
96
37
0
22 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
13
44
0
14 Dec 2023
Harmonics of Learning: Universal Fourier Features Emerge in Invariant
  Networks
Harmonics of Learning: Universal Fourier Features Emerge in Invariant Networks
G. Marchetti
Christopher Hillar
Danica Kragic
Sophia Sanborn
25
12
0
13 Dec 2023
12
Next