Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

18 July 2024

Papers citing "Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach"

26 / 26 papers shown

Title
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 359 2 0 10 Apr 2025
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 71 46 0 23 Apr 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 110 7 0 07 Nov 2023
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks Ziqian Zhong Ziming Liu Max Tegmark Jacob Andreas 69 100 0 30 Jun 2023
Faith and Fate: Limits of Transformers on Compositionality Nouha Dziri Ximing Lu Melanie Sclar Xiang Lorraine Li Liwei Jian ... Sean Welleck Xiang Ren Allyson Ettinger Zaïd Harchaoui Yejin Choi ReLM LRM 122 376 0 29 May 2023
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Zhengxuan Wu Atticus Geiger Thomas Icard Christopher Potts Noah D. Goodman MILM 75 92 0 15 May 2023
Discovering Latent Knowledge in Language Models Without Supervision Collin Burns Haotian Ye Dan Klein Jacob Steinhardt 128 375 0 07 Dec 2022
Transformers Learn Shortcuts to Automata Bingbin Liu Jordan T. Ash Surbhi Goel A. Krishnamurthy Cyril Zhang OffRL LRM 126 175 0 19 Oct 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 93 132 0 27 Jul 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 269 445 0 24 Feb 2021
Influence Patterns for Explaining Information Flow in BERT Kaiji Lu Zifan Wang Piotr (Peter) Mardziel Anupam Datta GNN 65 16 0 02 Nov 2020
The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? Jasmijn Bastings Katja Filippova XAI LRM 95 177 0 12 Oct 2020
Quantifying Attention Flow in Transformers Samira Abnar Willem H. Zuidema 157 796 0 02 May 2020
On Completeness-aware Concept-Based Explanations in Deep Neural Networks Chih-Kuan Yeh Been Kim Sercan O. Arik Chun-Liang Li Tomas Pfister Pradeep Ravikumar FAtt 228 305 0 17 Oct 2019
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks Mehdi Neshat Zifan Wang Bradley Alexander Fan Yang Zijian Zhang Sirui Ding Markus Wagner Xia Hu FAtt 93 1,069 0 03 Oct 2019
Learning to Deceive with Attention-Based Explanations Danish Pruthi Mansi Gupta Bhuwan Dhingra Graham Neubig Zachary Chase Lipton 74 193 0 17 Sep 2019
Designing and Interpreting Probes with Control Tasks John Hewitt Percy Liang 70 536 0 08 Sep 2019
Attention is not not Explanation Sarah Wiegreffe Yuval Pinter XAI AAML FAtt 120 909 0 13 Aug 2019
BERT Rediscovers the Classical NLP Pipeline Ian Tenney Dipanjan Das Ellie Pavlick MILM SSeg 135 1,471 0 15 May 2019
Attention is not Explanation Sarthak Jain Byron C. Wallace FAtt 145 1,324 0 26 Feb 2019
RISE: Randomized Input Sampling for Explanation of Black-box Models Vitali Petsiuk Abir Das Kate Saenko FAtt 181 1,170 0 19 Jun 2018
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim Martin Wattenberg Justin Gilmer Carrie J. Cai James Wexler F. Viégas Rory Sayres FAtt 214 1,842 0 30 Nov 2017
Interpretable Explanations of Black Boxes by Meaningful Perturbation Ruth C. Fong Andrea Vedaldi FAtt AAML 74 1,520 0 11 Apr 2017
Axiomatic Attribution for Deep Networks Mukund Sundararajan Ankur Taly Qiqi Yan OOD FAtt 188 5,989 0 04 Mar 2017
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R. Selvaraju Michael Cogswell Abhishek Das Ramakrishna Vedantam Devi Parikh Dhruv Batra FAtt 297 20,023 0 07 Oct 2016
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan Andrea Vedaldi Andrew Zisserman FAtt 312 7,295 0 20 Dec 2013