Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

27 September 2023

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

37 / 87 papers shown

Title
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models Somnath Banerjee Soham Tripathy Sayan Layek Shanu Kumar Animesh Mukherjee Rima Hazra 25 1 0 18 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 27 23 0 17 Jun 2024
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network Erik Jenner Shreyas Kapur Vasil Georgiev Cameron Allen Scott Emmons Stuart J. Russell 32 10 0 02 Jun 2024
Exploring and steering the moral compass of Large Language Models Alejandro Tlaie LLMSV 32 3 0 27 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks Chak Tou Leong Yi Cheng Kaishuai Xu Jian Wang Hanlin Wang Wenjie Li AAML 51 17 0 25 May 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Boshi Wang Xiang Yue Yu-Chuan Su Huan Sun LRM 29 41 0 23 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Charles OÑeill Thang Bui 40 5 0 21 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward Raphael Milliere Cameron Buckner LRM 66 13 0 06 May 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 35 37 0 23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 114 0 22 Apr 2024
Finding Visual Task Vectors Alberto Hojel Yutong Bai Trevor Darrell Amir Globerson Amir Bar 64 6 0 08 Apr 2024
Locating and Editing Factual Associations in Mamba Arnab Sen Sharma David Atkinson David Bau KELM 73 28 0 04 Apr 2024
Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph Marco Bronzini Carlo Nicolini Bruno Lepri Jacopo Staiano Andrea Passerini KELM 28 5 0 04 Apr 2024
On Large Language Models' Hallucination with Regard to Known Facts Che Jiang Biqing Qi Xiangyu Hong Dayuan Fu Yang Cheng Fandong Meng Mo Yu Bowen Zhou Jie Zhou HILM LRM 31 16 0 29 Mar 2024
Localizing Paragraph Memorization in Language Models Niklas Stoehr Mitchell Gordon Chiyuan Zhang Owen Lewis MU 38 13 0 28 Mar 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models Ang Lv Yuhan Chen Kaiyi Zhang Yulong Wang Lifeng Liu Ji-Rong Wen Jian Xie Rui Yan KELM 32 16 0 28 Mar 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 51 34 0 26 Mar 2024
Monotonic Representation of Numeric Properties in Language Models Benjamin Heinzerling Kentaro Inui KELM MILM 45 9 0 15 Mar 2024
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines Michael Toker Hadas Orgad Mor Ventura Dana Arad Yonatan Belinkov DiffM 64 12 0 09 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models Adithya Bhaskar Dan Friedman Danqi Chen 35 5 0 06 Mar 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta Joykirat Singh Soumen Chakrabarti Tanmoy Chakraborty LRM 43 23 0 28 Feb 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models Zhuoran Jin Pengfei Cao Hongbang Yuan Yubo Chen Jiexin Xu Huaijun Li Xiaojian Jiang Kang Liu Jun Zhao 183 36 0 28 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 39 20 0 19 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models Goutham Rajendran Simon Buchholz Bryon Aragam Bernhard Schölkopf Pradeep Ravikumar AI4CE 91 21 0 14 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 34 87 0 11 Jan 2024
Neuron-Level Knowledge Attribution in Large Language Models Zeping Yu Sophia Ananiadou FAtt KELM 24 7 0 19 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang Miles Wang Kaivu Hariharan Nir Shavit 21 2 0 14 Dec 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l James Dao Yeu-Tong Lau Can Rager Jett Janiak 35 5 0 11 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 186 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 191 261 0 28 Apr 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Yuchen Li Yuan-Fang Li Andrej Risteski 120 61 0 07 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 75 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 135 25 0 04 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 460 0 24 Sep 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 201 117 0 26 Jan 2022