Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.09951
Cited By
Optimal ablation for interpretability
16 September 2024
Maximilian Li
Lucas Janson
FAtt
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Optimal ablation for interpretability"
44 / 44 papers shown
Title
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
185
20
0
24 Jun 2024
How to use and interpret activation patching
Stefan Heimersheim
Neel Nanda
73
47
0
23 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
91
17
0
17 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
133
158
0
28 Mar 2024
Explorations of Self-Repair in Language Models
Cody Rushing
Neel Nanda
KELM
MILM
LRM
53
13
0
23 Feb 2024
Function Vectors in Large Language Models
Eric Todd
Millicent Li
Arnab Sen Sharma
Aaron Mueller
Byron C. Wallace
David Bau
55
120
0
23 Oct 2023
Circuit Component Reuse Across Tasks in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
76
71
0
12 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
189
114
0
27 Sep 2023
Linearity of Relation Decoding in Transformer Language Models
Evan Hernandez
Arnab Sen Sharma
Tal Haklay
Kevin Meng
Martin Wattenberg
Jacob Andreas
Yonatan Belinkov
David Bau
KELM
75
100
0
17 Aug 2023
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
86
119
0
06 Jun 2023
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
140
383
0
07 Dec 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
319
524
0
24 Sep 2022
Analyzing Transformers in Embedding Space
Guy Dar
Mor Geva
Ankit Gupta
Jonathan Berant
58
92
0
06 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
93
133
0
27 Jul 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
248
1,381
0
10 Feb 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
302
124
0
26 Jan 2022
The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations
Peter Hase
Harry Xie
Joey Tianyi Zhou
OODD
LRM
FAtt
85
91
0
01 Jun 2021
Low-Complexity Probing via Finding Subnetworks
Steven Cao
Victor Sanh
Alexander M. Rush
43
54
0
08 Apr 2021
Explaining by Removing: A Unified Framework for Model Explanation
Ian Covert
Scott M. Lundberg
Su-In Lee
FAtt
106
251
0
21 Nov 2020
Interpretation of NLP models through input marginalization
Siwon Kim
Jihun Yi
Eunji Kim
Sungroh Yoon
MILM
FAtt
78
60
0
27 Oct 2020
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking
Michael Schlichtkrull
Nicola De Cao
Ivan Titov
AI4CE
86
220
0
01 Oct 2020
Understanding the Role of Individual Units in a Deep Neural Network
David Bau
Jun-Yan Zhu
Hendrik Strobelt
Àgata Lapedriza
Bolei Zhou
Antonio Torralba
GAN
69
452
0
10 Sep 2020
Neuron Shapley: Discovering the Responsible Neurons
Amirata Ghorbani
James Zou
FAtt
TDI
54
113
0
23 Feb 2020
Restricting the Flow: Information Bottlenecks for Attribution
Karl Schulz
Leon Sixt
Federico Tombari
Tim Landgraf
FAtt
61
190
0
02 Jan 2020
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
Dylan Slack
Sophie Hilgard
Emily Jia
Sameer Singh
Himabindu Lakkaraju
FAtt
AAML
MLAU
77
821
0
06 Nov 2019
Feature relevance quantification in explainable AI: A causal problem
Dominik Janzing
Lenon Minorics
Patrick Blobaum
FAtt
CML
74
282
0
29 Oct 2019
CXPlain: Causal Explanations for Model Interpretation under Uncertainty
Patrick Schwab
W. Karlen
FAtt
CML
120
209
0
27 Oct 2019
Understanding Deep Networks via Extremal Perturbations and Smooth Masks
Ruth C. Fong
Mandela Patrick
Andrea Vedaldi
AAML
73
418
0
18 Oct 2019
The emergence of number and syntax units in LSTM language models
Yair Lakretz
Germán Kruszewski
T. Desbordes
Dieuwke Hupkes
S. Dehaene
Marco Baroni
53
171
0
18 Mar 2019
A Benchmark for Interpretability Methods in Deep Neural Networks
Sara Hooker
D. Erhan
Pieter-Jan Kindermans
Been Kim
FAtt
UQCV
114
682
0
28 Jun 2018
RISE: Randomized Input Sampling for Explanation of Black-box Models
Vitali Petsiuk
Abir Das
Kate Saenko
FAtt
181
1,176
0
19 Jun 2018
Learning Efficient Convolutional Networks through Network Slimming
Zhuang Liu
Jianguo Li
Zhiqiang Shen
Gao Huang
Shoumeng Yan
Changshui Zhang
125
2,424
0
22 Aug 2017
A Unified Approach to Interpreting Model Predictions
Scott M. Lundberg
Su-In Lee
FAtt
1.1K
22,002
0
22 May 2017
Real Time Image Saliency for Black Box Classifiers
P. Dabkowski
Y. Gal
70
591
0
22 May 2017
Network Dissection: Quantifying Interpretability of Deep Visual Representations
David Bau
Bolei Zhou
A. Khosla
A. Oliva
Antonio Torralba
MILM
FAtt
150
1,523
1
19 Apr 2017
Interpretable Explanations of Black Boxes by Meaningful Perturbation
Ruth C. Fong
Andrea Vedaldi
FAtt
AAML
76
1,525
0
11 Apr 2017
Axiomatic Attribution for Deep Networks
Mukund Sundararajan
Ankur Taly
Qiqi Yan
OOD
FAtt
191
6,015
0
04 Mar 2017
Understanding Neural Networks through Representation Erasure
Jiwei Li
Will Monroe
Dan Jurafsky
AAML
MILM
91
567
0
24 Dec 2016
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro
Sameer Singh
Carlos Guestrin
FAtt
FaML
1.2K
17,027
0
16 Feb 2016
Object Detectors Emerge in Deep Scene CNNs
Bolei Zhou
A. Khosla
Àgata Lapedriza
A. Oliva
Antonio Torralba
ObjD
153
1,283
0
22 Dec 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan
Andrea Vedaldi
Andrew Zisserman
FAtt
312
7,308
0
20 Dec 2013
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler
Rob Fergus
FAtt
SSL
595
15,902
0
12 Nov 2013
How to Explain Individual Classification Decisions
D. Baehrens
T. Schroeter
Stefan Harmeling
M. Kawanabe
K. Hansen
K. Müller
FAtt
137
1,104
0
06 Dec 2009
Variable importance in binary regression trees and forests
H. Ishwaran
193
386
0
15 Nov 2007
1