How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability

7 May 2024

Papers citing "How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability"

11 / 11 papers shown

Title
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability Vishnu Kabir Chhabra Mohammad Mahdi Khalili AI4CE 33 0 0 05 Apr 2025
Investigating Neurons and Heads in Transformer-based LLMs for Typographical Errors Kohei Tsuji Tatsuya Hiraoka Yuchang Cheng Eiji Aramaki Tomoya Iwakura 79 0 0 27 Feb 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Vishnu Kabir Chhabra Ding Zhu Mohammad Mahdi Khalili 45 2 0 27 Feb 2025
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 31 3 0 06 Sep 2024
Attention Heads of Large Language Models: A Survey Zifan Zheng Yezhaohui Wang Yuxin Huang Shichao Song Mingchuan Yang Bo Tang Zhiyu Li Zhiyu Li LRM 58 22 0 05 Sep 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models Geonhee Kim Marco Valentino André Freitas LRM AI4CE 30 7 0 16 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability Jorge García-Carrasco A. Maté Juan Trujillo AAML 31 3 0 29 Jul 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 193 121 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 497 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 131 322 0 21 Sep 2022