The Hydra Effect: Emergent Self-repair in Language Model Computations

The Hydra Effect: Emergent Self-repair in Language Model Computations

28 July 2023

Vladimir Mikulik

Papers citing "The Hydra Effect: Emergent Self-repair in Language Model Computations"

17 / 17 papers shown

Title
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification Leon Eshuijs Shihan Wang Antske Fokkens 26 0 0 09 May 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 30 0 0 18 Apr 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 75 2 0 10 Mar 2025
Do Multilingual LLMs Think In English? Lisa Schut Y. Gal Sebastian Farquhar 44 3 0 24 Feb 2025
Activation Steering in Neural Theorem Provers Shashank Kirtania LLMSV 163 0 0 21 Feb 2025
AND: Audio Network Dissection for Interpreting Deep Acoustic Models Tung-Yu Wu Yu-Xiang Lin Tsui-Wei Weng 52 1 0 24 Jun 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 62 17 0 24 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP Yossi Gandelsman Alexei A. Efros Jacob Steinhardt MILM 56 16 0 06 Jun 2024
LoFiT: Localized Fine-tuning on LLM Representations Fangcong Yin Xi Ye Greg Durrett 38 13 0 03 Jun 2024
The Garden of Forking Paths: Observing Dynamic Parameters Distribution in Large Language Models Carlo Nicolini Jacopo Staiano Bruno Lepri Raffaele Marino MoE 26 1 0 13 Mar 2024
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes Lucio Dery Steven Kolawole Jean-Francois Kagey Virginia Smith Graham Neubig Ameet Talwalkar 39 28 0 08 Feb 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 33 97 0 27 Sep 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 186 0 02 May 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 75 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 460 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 318 0 21 Sep 2022