Optimal ablation for interpretability

16 September 2024

Papers citing "Optimal ablation for interpretability"

44 / 44 papers shown

Title
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 185 20 0 24 Jun 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 73 47 0 23 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation Harshay Shah Andrew Ilyas Aleksander Madry KELM 91 17 0 17 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 133 158 0 28 Mar 2024
Explorations of Self-Repair in Language Models Cody Rushing Neel Nanda KELM MILM LRM 53 13 0 23 Feb 2024
Function Vectors in Large Language Models Eric Todd Millicent Li Arnab Sen Sharma Aaron Mueller Byron C. Wallace David Bau 55 120 0 23 Oct 2023
Circuit Component Reuse Across Tasks in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 76 71 0 12 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 189 114 0 27 Sep 2023
Linearity of Relation Decoding in Transformer Language Models Evan Hernandez Arnab Sen Sharma Tal Haklay Kevin Meng Martin Wattenberg Jacob Andreas Yonatan Belinkov David Bau KELM 75 100 0 17 Aug 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 86 119 0 06 Jun 2023
Discovering Latent Knowledge in Language Models Without Supervision Collin Burns Haotian Ye Dan Klein Jacob Steinhardt 140 383 0 07 Dec 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 319 524 0 24 Sep 2022
Analyzing Transformers in Embedding Space Guy Dar Mor Geva Ankit Gupta Jonathan Berant 58 92 0 06 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 93 133 0 27 Jul 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 248 1,381 0 10 Feb 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 302 124 0 26 Jan 2022
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations Peter Hase Harry Xie Joey Tianyi Zhou OODD LRM FAtt 85 91 0 01 Jun 2021
Low-Complexity Probing via Finding Subnetworks Steven Cao Victor Sanh Alexander M. Rush 43 54 0 08 Apr 2021
Explaining by Removing: A Unified Framework for Model Explanation Ian Covert Scott M. Lundberg Su-In Lee FAtt 106 251 0 21 Nov 2020
Interpretation of NLP models through input marginalization Siwon Kim Jihun Yi Eunji Kim Sungroh Yoon MILM FAtt 78 60 0 27 Oct 2020
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking Michael Schlichtkrull Nicola De Cao Ivan Titov AI4CE 86 220 0 01 Oct 2020
Understanding the Role of Individual Units in a Deep Neural Network David Bau Jun-Yan Zhu Hendrik Strobelt Àgata Lapedriza Bolei Zhou Antonio Torralba GAN 69 452 0 10 Sep 2020
Neuron Shapley: Discovering the Responsible Neurons Amirata Ghorbani James Zou FAtt TDI 54 113 0 23 Feb 2020
Restricting the Flow: Information Bottlenecks for Attribution Karl Schulz Leon Sixt Federico Tombari Tim Landgraf FAtt 61 190 0 02 Jan 2020
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods Dylan Slack Sophie Hilgard Emily Jia Sameer Singh Himabindu Lakkaraju FAtt AAML MLAU 77 821 0 06 Nov 2019
Feature relevance quantification in explainable AI: A causal problem Dominik Janzing Lenon Minorics Patrick Blobaum FAtt CML 74 282 0 29 Oct 2019
CXPlain: Causal Explanations for Model Interpretation under Uncertainty Patrick Schwab W. Karlen FAtt CML 120 209 0 27 Oct 2019
Understanding Deep Networks via Extremal Perturbations and Smooth Masks Ruth C. Fong Mandela Patrick Andrea Vedaldi AAML 73 418 0 18 Oct 2019
The emergence of number and syntax units in LSTM language models Yair Lakretz Germán Kruszewski T. Desbordes Dieuwke Hupkes S. Dehaene Marco Baroni 53 171 0 18 Mar 2019
A Benchmark for Interpretability Methods in Deep Neural Networks Sara Hooker D. Erhan Pieter-Jan Kindermans Been Kim FAtt UQCV 114 682 0 28 Jun 2018
RISE: Randomized Input Sampling for Explanation of Black-box Models Vitali Petsiuk Abir Das Kate Saenko FAtt 181 1,176 0 19 Jun 2018
Learning Efficient Convolutional Networks through Network Slimming Zhuang Liu Jianguo Li Zhiqiang Shen Gao Huang Shoumeng Yan Changshui Zhang 125 2,424 0 22 Aug 2017
A Unified Approach to Interpreting Model Predictions Scott M. Lundberg Su-In Lee FAtt 1.1K 22,002 0 22 May 2017
Real Time Image Saliency for Black Box Classifiers P. Dabkowski Y. Gal 70 591 0 22 May 2017
Network Dissection: Quantifying Interpretability of Deep Visual Representations David Bau Bolei Zhou A. Khosla A. Oliva Antonio Torralba MILM FAtt 150 1,523 1 19 Apr 2017
Interpretable Explanations of Black Boxes by Meaningful Perturbation Ruth C. Fong Andrea Vedaldi FAtt AAML 76 1,525 0 11 Apr 2017
Axiomatic Attribution for Deep Networks Mukund Sundararajan Ankur Taly Qiqi Yan OOD FAtt 191 6,015 0 04 Mar 2017
Understanding Neural Networks through Representation Erasure Jiwei Li Will Monroe Dan Jurafsky AAML MILM 91 567 0 24 Dec 2016
"Why Should I Trust You?": Explaining the Predictions of Any Classifier Marco Tulio Ribeiro Sameer Singh Carlos Guestrin FAtt FaML 1.2K 17,027 0 16 Feb 2016
Object Detectors Emerge in Deep Scene CNNs Bolei Zhou A. Khosla Àgata Lapedriza A. Oliva Antonio Torralba ObjD 153 1,283 0 22 Dec 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan Andrea Vedaldi Andrew Zisserman FAtt 312 7,308 0 20 Dec 2013
Visualizing and Understanding Convolutional Networks Matthew D. Zeiler Rob Fergus FAtt SSL 595 15,902 0 12 Nov 2013
How to Explain Individual Classification Decisions D. Baehrens T. Schroeter Stefan Harmeling M. Kawanabe K. Hansen K. Müller FAtt 137 1,104 0 06 Dec 2009
Variable importance in binary regression trees and forests H. Ishwaran 193 386 0 15 Nov 2007