Causal Abstractions of Neural Networks

6 June 2021

Papers citing "Causal Abstractions of Neural Networks"

42 / 42 papers shown

Title
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification Leon Eshuijs Shihan Wang Antske Fokkens 26 0 0 09 May 2025
Understanding In-context Learning of Addition via Activation Subspaces Xinyan Hu Kayo Yin Michael I. Jordan Jacob Steinhardt Lijie Chen 51 0 0 08 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 78 1 0 02 May 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 43 1 0 17 Apr 2025
Are formal and functional linguistic mechanisms dissociated in language models? Michael Hanna Sandro Pezzelle Yonatan Belinkov 45 0 0 14 Mar 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy Ruixi Lin Ziqiao Wang Yang You FaML 84 0 0 07 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View Yiyang Liu James Liang Ruixiang Tang Yugyung Lee Majid Rabbani ... Raghuveer M. Rao Lifu Huang Dongfang Liu Qifan Wang Cheng Han 123 0 0 02 Mar 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 147 0 0 17 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning L. Zhang Lijie Hu Di Wang LRM 90 0 0 17 Feb 2025
What is causal about causal models and representations? Frederik Hytting Jørgensen Luigi Gresele S. Weichwald CML 105 0 0 31 Jan 2025
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 40 5 0 07 Nov 2024
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion Denitsa Saynova Lovisa Hagström Moa Johansson Richard Johansson Marco Kuhlmann HILM 39 0 0 18 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Z. Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Junfeng Fang Yongbin Li 57 5 0 17 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 80 1 0 02 Oct 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 36 10 0 27 Jul 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 61 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 79 19 0 02 Jul 2024
Monitoring Latent World States in Language Models with Propositional Probes Jiahai Feng Stuart Russell Jacob Steinhardt HILM 34 6 0 27 Jun 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 32 6 0 09 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 44 111 0 28 Mar 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 27 87 0 11 Jan 2024
Interventionally Consistent Surrogates for Agent-based Simulators Joel Dyer Nicholas Bishop Yorgos Felekis Fabio Massimo Zennaro Anisoara Calinescu Theodoros Damoulas Michael Wooldridge 19 6 0 18 Dec 2023
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals Yanai Elazar Bhargavi Paranjape Hao Peng Sarah Wiegreffe Khyathi Raghavi Vivek Srikumar Sameer Singh Noah A. Smith AAML OOD 26 0 0 16 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 72 7 0 07 Nov 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 30 27 0 26 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers Anna Langedijk Hosein Mohebbi Gabriele Sarti Willem H. Zuidema Jaap Jumelet 19 10 0 05 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 28 97 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 28 332 0 15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 26 20 0 27 Aug 2023
Arithmetic with Language Models: from Memorization to Computation Davide Maltoni Matteo Ferrara KELM LRM 27 5 0 02 Aug 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 38 12 0 27 Jul 2023
Causal interventions expose implicit situation models for commonsense language understanding Takateru Yamakoshi James L. McClelland A. Goldberg Robert D. Hawkins 17 6 0 06 Jun 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 26 6 0 13 Apr 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 18 85 0 12 Apr 2023
A Discerning Several Thousand Judgments: GPT-3 Rates the Article + Adjective + Numeral + Noun Construction Kyle Mahowald 22 24 0 29 Jan 2023
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training Jing-ling Huang Zhengxuan Wu Kyle Mahowald Christopher Potts 21 13 0 19 Dec 2022
Causal Abstraction with Soft Interventions Riccardo Massidda Atticus Geiger Thomas F. Icard D. Bacciu 23 13 0 22 Nov 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 75 35 0 28 Sep 2022
FACT: Learning Governing Abstractions Behind Integer Sequences Peter Belcak Ard Kastrati Flavio Schenker Roger Wattenhofer 31 5 0 20 Sep 2022
Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling Jakob Prange Nathan Schneider Lingpeng Kong 19 9 0 15 Dec 2021
Inducing Causal Structure for Interpretable Neural Networks Atticus Geiger Zhengxuan Wu Hanson Lu J. Rozner Elisa Kreiss Thomas F. Icard Noah D. Goodman Christopher Potts CML OOD 16 70 0 01 Dec 2021
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond Amir Feder Katherine A. Keith Emaad A. Manzoor Reid Pryzant Dhanya Sridhar ... Roi Reichart Margaret E. Roberts Brandon M Stewart Victor Veitch Diyi Yang CML 30 234 0 02 Sep 2021