Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

27 July 2022

Tilman Raukur

A. Ho

Stephen Casper

Dylan Hadfield-Menell

AAML

AI4CE

ArXiv PDF HTML

Papers citing "Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks"

43 / 43 papers shown

Title
Studying Small Language Models with Susceptibilities Garrett Baker George Wang Jesse Hoogland Daniel Murfet AAML 75 1 0 25 Apr 2025
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning Gangwei Jiang Caigao Jiang Zhaoyi Li Siqiao Xue Jun-ping Zhou Linqi Song Defu Lian Yin Wei CLL MU 58 0 0 16 Feb 2025
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 122 1 0 16 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 93 1 0 20 Nov 2024
Explainable Artificial Intelligence: A Survey of Needs, Techniques, Applications, and Future Direction Melkamu Mersha Khang Lam Joseph Wood Ali AlShami Jugal Kalita XAI AI4TS 67 28 0 30 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting Suraj Anand Michael A. Lepori Jack Merullo Ellie Pavlick CLL 29 6 0 28 May 2024
Understanding Multimodal Deep Neural Networks: A Concept Selection View Chenming Shang Hengyuan Zhang Hao Wen Yujiu Yang 43 5 0 13 Apr 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 34 87 0 11 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability Edmund Mills Shiye Su Stuart J. Russell Scott Emmons 48 7 0 20 Dec 2023
Applications of Spiking Neural Networks in Visual Place Recognition S. Hussaini Michael Milford Tobias Fischer 66 6 0 22 Nov 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 33 141 0 03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 30 97 0 27 Sep 2023
Arithmetic with Language Models: from Memorization to Computation Davide Maltoni Matteo Ferrara KELM LRM 32 5 0 02 Aug 2023
A General Framework for Interpretable Neural Learning based on Local Information-Theoretic Goal Functions Abdullah Makkeh Marcel Graetz Andreas C. Schneider David A. Ehrlich V. Priesemann Michael Wibral 44 1 0 03 Jun 2023
Similarity of Neural Network Models: A Survey of Functional and Representational Measures Max Klabunde Tobias Schumacher M. Strohmaier Florian Lemmerich 52 64 0 10 May 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 21 85 0 12 Apr 2023
Rejecting Cognitivism: Computational Phenomenology for Deep Learning P. Beckmann G. Köstner Ines Hipólito 32 4 0 16 Feb 2023
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 250 458 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 317 0 21 Sep 2022
Probing via Prompting Jiaoda Li Ryan Cotterell Mrinmaya Sachan 29 13 0 04 Jul 2022
Attribution-based Explanations that Provide Recourse Cannot be Robust H. Fokkema R. D. Heide T. Erven FAtt 44 18 0 31 May 2022
Post-hoc Concept Bottleneck Models Mert Yuksekgonul Maggie Wang James Y. Zou 143 185 0 31 May 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 81 57 0 28 Jan 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 201 117 0 26 Jan 2022
Interpretable Image Classification with Differentiable Prototypes Assignment Dawid Rymarczyk Lukasz Struski Michal Górszczak K. Lewandowska Jacek Tabor Bartosz Zieliñski 34 98 0 06 Dec 2021
Editing a classifier by rewriting its prediction rules Shibani Santurkar Dimitris Tsipras Mahalaxmi Elango David Bau Antonio Torralba A. Madry KELM 175 89 0 02 Dec 2021
"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification Jasmijn Bastings Sebastian Ebert Polina Zablotskaia Anders Sandholm Katja Filippova 115 75 0 14 Nov 2021
Quantifying Local Specialization in Deep Neural Networks Shlomi Hod Daniel Filan Stephen Casper Andrew Critch Stuart J. Russell 60 10 0 13 Oct 2021
Robust Feature-Level Adversaries are Interpretability Tools Stephen Casper Max Nadeau Dylan Hadfield-Menell Gabriel Kreiman AAML 42 27 0 07 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021
ERIC: Extracting Relations Inferred from Convolutions Joe Townsend Theodoros Kasioumis Hiroya Inakoshi NAI FAtt 16 16 0 19 Oct 2020
Optimizing Mode Connectivity via Neuron Alignment N. Joseph Tatro Pin-Yu Chen Payel Das Igor Melnyk P. Sattigeri Rongjie Lai MoMe 223 80 0 05 Sep 2020
On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vectors Adriano Lucieri Muhammad Naseer Bajwa S. Braun M. I. Malik Andreas Dengel Sheraz Ahmed MedIm 161 64 0 05 May 2020
What is the State of Neural Network Pruning? Davis W. Blalock Jose Javier Gonzalez Ortiz Jonathan Frankle John Guttag 188 1,027 0 06 Mar 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 231 4,469 0 23 Jan 2020
Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost M. Gao Zizhao Zhang Guo-Ding Yu Sercan Ö. Arik L. Davis Tomas Pfister 160 195 0 16 Oct 2019
e-SNLI: Natural Language Inference with Natural Language Explanations Oana-Maria Camburu Tim Rocktaschel Thomas Lukasiewicz Phil Blunsom LRM 255 620 0 04 Dec 2018
Revisiting the Importance of Individual Units in CNNs via Ablation Bolei Zhou Yiyou Sun David Bau Antonio Torralba FAtt 59 116 0 07 Jun 2018
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 201 882 0 03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 251 3,683 0 28 Feb 2017
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky Jia Deng Hao Su J. Krause S. Satheesh ... A. Karpathy A. Khosla Michael S. Bernstein Alexander C. Berg Li Fei-Fei VLM ObjD 296 39,194 0 01 Sep 2014