How to use and interpret activation patching

23 April 2024

Papers citing "How to use and interpret activation patching"

36 / 36 papers shown

Title
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates Hang Chen Jiaying Zhu Xinyu Yang Wenya Wang LRM 9 0 0 15 May 2025
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation Chiara Manna Afra Alishahi Frédéric Blain Eva Vanmassenhove 24 0 0 13 May 2025
Towards Quantifying Commonsense Reasoning with Mechanistic Insights Abhinav Joshi A. Ahmad Divyaksh Shukla Ashutosh Modi ReLM LRM 36 0 0 14 Apr 2025
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective Qi Liu Jiaxin Mao Ji-Rong Wen LRM 29 0 0 10 Apr 2025
Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI Nooshin Bahador 50 1 0 24 Mar 2025
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack Murong Yue Ziyu Yao SILM AAML 56 0 0 18 Mar 2025
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Philip Quirke Clement Neo Abir Harrasse Dhruv Nathawani Amir Abdullah 44 0 0 17 Mar 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 46 0 0 04 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 102 0 0 24 Feb 2025
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare Hiba Ahsan Arnab Sen Sharma Silvio Amir David Bau Byron C. Wallace 88 0 0 20 Feb 2025
Exploring Translation Mechanism of Large Language Models Hongbin Zhang Kehai Chen Xuefeng Bai Xiucheng Li Yang Xiang Min Zhang 64 1 0 17 Feb 2025
Designing Role Vectors to Improve LLM Inference Behaviour Daniele Potertì Andrea Seveso Fabio Mercorio LLMSV 49 0 0 17 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models Jesseba Fernando Grigori Guitchounts AI4CE 36 0 0 17 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models Ala Nekouvaght Tak Amin Banayeeanzade Anahita Bolourani Mina Kian Robin Jia Jonathan Gratch 54 0 0 08 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis Sengim Karayalçin Marina Krček Stjepan Picek AAML 75 0 0 01 Feb 2025
Representation in large language models Cameron C. Yetman 41 1 0 03 Jan 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 82 10 0 21 Nov 2024
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis Guan Zhe Hong Nishanth Dikkala Enming Luo Cyrus Rashtchian Xin Wang Rina Panigrahy OffRL LRM NAI 36 0 0 06 Nov 2024
Unlearning-based Neural Interpretations Ching Lam Choi Alexandre Duplessis Serge Belongie FAtt 44 0 0 10 Oct 2024
How Language Models Prioritize Contextual Grammatical Cues? Hamidreza Amirzadeh A. Alishahi Hosein Mohebbi 21 0 0 04 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 85 1 0 02 Oct 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 49 2 0 16 Sep 2024
Attention Heads of Large Language Models: A Survey Zifan Zheng Yezhaohui Wang Yuxin Huang Shichao Song Mingchuan Yang Bo Tang Feiyu Xiong Zhiyu Li LRM 58 21 0 05 Sep 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models Geonhee Kim Marco Valentino André Freitas LRM AI4CE 28 7 0 16 Aug 2024
The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights Nura Aljaafari Danilo S. Carvalho André Freitas KELM 32 0 0 05 Aug 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 41 1 0 18 Jul 2024
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent Karolis Jucys George Adamopoulos Mehrab Hamidi Stephanie Milani Mohammad Reza Samsami Artem Zholus Sonia Joseph Blake A. Richards Irina Rish Özgür Simsek 42 2 0 16 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller CML 30 10 0 05 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 35 17 0 25 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 27 23 0 17 Jun 2024
Controlling Large Language Model Agents with Entropic Activation Steering Nate Rahn P. DÓro Marc G. Bellemare LLMSV 30 6 0 01 Jun 2024
Exploring and steering the moral compass of Large Language Models Alejandro Tlaie LLMSV 32 3 0 27 May 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 191 261 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022