Universal Neurons in GPT2 Language Models

22 January 2024

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

Papers citing "Universal Neurons in GPT2 Language Models"

20 / 20 papers shown

Title
Shared Global and Local Geometry of Language Model Embeddings Andrew Lee Melanie Weber F. Viégas Martin Wattenberg FedML 74 1 0 27 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 73 2 0 10 Mar 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference Go Kamoda Benjamin Heinzerling Tatsuro Inaba Keito Kudo Keisuke Sakaguchi Kentaro Inui MILM 31 0 0 27 Jan 2025
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 43 10 0 09 Oct 2024
Mitigating Copy Bias in In-Context Learning through Neuron Pruning Ameen Ali Lior Wolf Ivan Titov 36 2 0 02 Oct 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 38 10 0 27 Jul 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 63 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 48 3 0 27 May 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 76 10 0 13 Dec 2023
Scheming AIs: Will AIs fake alignment during training in order to get power? Joe Carlsmith 65 30 0 14 Nov 2023
On the Expressivity Role of LayerNorm in Transformers' Attention Shaked Brody Shiyu Jin Xinghao Zhu MoE 61 30 0 04 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 186 0 02 May 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 289 2,232 0 22 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 460 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 317 0 21 Sep 2022
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning M. Gwilliam Abhinav Shrivastava SSL 76 19 1 16 Jun 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 253 1,989 0 31 Dec 2020
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 251 3,683 0 28 Feb 2017