Challenges in Mechanistically Interpreting Model Representations

6 February 2024

Papers citing "Challenges in Mechanistically Interpreting Model Representations"

7 / 7 papers shown

Title
Modular Training of Neural Networks aids Interpretability Satvik Golechha Maheep Chaudhary Joan Velja Alessandro Abate Nandi Schoots 74 0 0 04 Feb 2025
Locking Down the Finetuned LLMs Safety Minjun Zhu Linyi Yang Yifan Wei Ningyu Zhang Yue Zhang 34 8 0 14 Oct 2024
Training Neural Networks for Modularity aids Interpretability Satvik Golechha Dylan R. Cope Nandi Schoots 25 0 0 24 Sep 2024
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 102 168 0 10 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 119 0 30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 75 98 0 05 Mar 2023
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 458 0 24 Sep 2022