Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2207.13243
Cited By
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
27 July 2022
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks"
43 / 43 papers shown
Title
Studying Small Language Models with Susceptibilities
Garrett Baker
George Wang
Jesse Hoogland
Daniel Murfet
AAML
75
1
0
25 Apr 2025
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning
Gangwei Jiang
Caigao Jiang
Zhaoyi Li
Siqiao Xue
Jun-ping Zhou
Linqi Song
Defu Lian
Yin Wei
CLL
MU
58
0
0
16 Feb 2025
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
122
1
0
16 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
93
1
0
20 Nov 2024
Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction
Melkamu Mersha
Khang Lam
Joseph Wood
Ali AlShami
Jugal Kalita
XAI
AI4TS
67
28
0
30 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
Suraj Anand
Michael A. Lepori
Jack Merullo
Ellie Pavlick
CLL
29
6
0
28 May 2024
Understanding Multimodal Deep Neural Networks: A Concept Selection View
Chenming Shang
Hengyuan Zhang
Hao Wen
Yujiu Yang
43
5
0
13 Apr 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
34
78
0
25 Jan 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
34
87
0
11 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills
Shiye Su
Stuart J. Russell
Scott Emmons
48
7
0
20 Dec 2023
Applications of Spiking Neural Networks in Visual Place Recognition
S. Hussaini
Michael Milford
Tobias Fischer
66
6
0
22 Nov 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
33
141
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
30
97
0
27 Sep 2023
Arithmetic with Language Models: from Memorization to Computation
Davide Maltoni
Matteo Ferrara
KELM
LRM
32
5
0
02 Aug 2023
A General Framework for Interpretable Neural Learning based on Local Information-Theoretic Goal Functions
Abdullah Makkeh
Marcel Graetz
Andreas C. Schneider
David A. Ehrlich
V. Priesemann
Michael Wibral
44
1
0
03 Jun 2023
Similarity of Neural Network Models: A Survey of Functional and Representational Measures
Max Klabunde
Tobias Schumacher
M. Strohmaier
Florian Lemmerich
52
64
0
10 May 2023
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
21
85
0
12 Apr 2023
Rejecting Cognitivism: Computational Phenomenology for Deep Learning
P. Beckmann
G. Köstner
Ines Hipólito
32
4
0
16 Feb 2023
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
458
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
125
317
0
21 Sep 2022
Probing via Prompting
Jiaoda Li
Ryan Cotterell
Mrinmaya Sachan
29
13
0
04 Jul 2022
Attribution-based Explanations that Provide Recourse Cannot be Robust
H. Fokkema
R. D. Heide
T. Erven
FAtt
44
18
0
31 May 2022
Post-hoc Concept Bottleneck Models
Mert Yuksekgonul
Maggie Wang
James Y. Zou
143
185
0
31 May 2022
Linear Adversarial Concept Erasure
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
81
57
0
28 Jan 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
201
117
0
26 Jan 2022
Interpretable Image Classification with Differentiable Prototypes Assignment
Dawid Rymarczyk
Lukasz Struski
Michal Górszczak
K. Lewandowska
Jacek Tabor
Bartosz Zieliñski
34
98
0
06 Dec 2021
Editing a classifier by rewriting its prediction rules
Shibani Santurkar
Dimitris Tsipras
Mahalaxmi Elango
David Bau
Antonio Torralba
A. Madry
KELM
175
89
0
02 Dec 2021
"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification
Jasmijn Bastings
Sebastian Ebert
Polina Zablotskaia
Anders Sandholm
Katja Filippova
115
75
0
14 Nov 2021
Quantifying Local Specialization in Deep Neural Networks
Shlomi Hod
Daniel Filan
Stephen Casper
Andrew Critch
Stuart J. Russell
60
10
0
13 Oct 2021
Robust Feature-Level Adversaries are Interpretability Tools
Stephen Casper
Max Nadeau
Dylan Hadfield-Menell
Gabriel Kreiman
AAML
42
27
0
07 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
226
405
0
24 Feb 2021
ERIC: Extracting Relations Inferred from Convolutions
Joe Townsend
Theodoros Kasioumis
Hiroya Inakoshi
NAI
FAtt
16
16
0
19 Oct 2020
Optimizing Mode Connectivity via Neuron Alignment
N. Joseph Tatro
Pin-Yu Chen
Payel Das
Igor Melnyk
P. Sattigeri
Rongjie Lai
MoMe
223
80
0
05 Sep 2020
On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vectors
Adriano Lucieri
Muhammad Naseer Bajwa
S. Braun
M. I. Malik
Andreas Dengel
Sheraz Ahmed
MedIm
161
64
0
05 May 2020
What is the State of Neural Network Pruning?
Davis W. Blalock
Jose Javier Gonzalez Ortiz
Jonathan Frankle
John Guttag
188
1,027
0
06 Mar 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
231
4,469
0
23 Jan 2020
Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost
M. Gao
Zizhao Zhang
Guo-Ding Yu
Sercan Ö. Arik
L. Davis
Tomas Pfister
160
195
0
16 Oct 2019
e-SNLI: Natural Language Inference with Natural Language Explanations
Oana-Maria Camburu
Tim Rocktaschel
Thomas Lukasiewicz
Phil Blunsom
LRM
255
620
0
04 Dec 2018
Revisiting the Importance of Individual Units in CNNs via Ablation
Bolei Zhou
Yiyou Sun
David Bau
Antonio Torralba
FAtt
59
116
0
07 Jun 2018
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
201
882
0
03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
251
3,683
0
28 Feb 2017
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
296
39,194
0
01 Sep 2014
1