Causal Abstractions of Neural Networks

6 June 2021

Papers citing "Causal Abstractions of Neural Networks"

50 / 56 papers shown

Title
Tracr-Injection: Distilling Algorithms into Pre-trained Language Models Tomás Vergara-Browne Álvaro Soto 17 0 0 15 May 2025
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification Leon Eshuijs Shihan Wang Antske Fokkens 33 0 0 09 May 2025
Understanding In-context Learning of Addition via Activation Subspaces Xinyan Hu Kayo Yin Michael I. Jordan Jacob Steinhardt Lijie Chen 58 0 0 08 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 88 1 0 02 May 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 53 1 0 17 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models Patrick Queiroz Da Silva Hari Sethuraman Dheeraj Rajagopal Hannaneh Hajishirzi Sachin Kumar LLMSV 39 1 0 06 Apr 2025
Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B Aleksandra Bakalova Yana Veitsman Xinting Huang Michael Hahn 38 0 0 31 Mar 2025
Are formal and functional linguistic mechanisms dissociated in language models? Michael Hanna Sandro Pezzelle Yonatan Belinkov 54 0 0 14 Mar 2025
Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy Ruixi Lin Ziqiao Wang Yang You FaML 89 1 0 07 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View Yiyang Liu James Liang Ruixiang Tang Yugyung Lee Majid Rabbani ... Raghuveer M. Rao Lifu Huang Dongfang Liu Qifan Wang Cheng Han 213 0 0 02 Mar 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning Lefei Zhang Lijie Hu Di Wang LRM 100 1 0 17 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 160 0 0 17 Feb 2025
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions H. Fokkema T. Erven Sara Magliacane 72 1 0 10 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis Sengim Karayalçin Marina Krček Stjepan Picek AAML 80 0 0 01 Feb 2025
What is causal about causal models and representations? Frederik Hytting Jørgensen Luigi Gresele S. Weichwald CML 113 0 0 31 Jan 2025
Aligning Graphical and Functional Causal Abstractions Wilem Schooltink Fabio Massimo Zennaro 73 1 0 22 Dec 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 47 5 0 07 Nov 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Yaniv Nikankin Anja Reusch Aaron Mueller Yonatan Belinkov AIFin LRM 43 25 0 28 Oct 2024
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion Denitsa Saynova Lovisa Hagström Moa Johansson Richard Johansson Marco Kuhlmann HILM 46 0 0 18 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Sihang Li Yongbin Li 59 5 0 17 Oct 2024
Inference and Verbalization Functions During In-Context Learning Junyi Tao Xiaoyin Chen Nelson F. Liu LRM ReLM 26 1 0 12 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 87 1 0 02 Oct 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 47 13 0 27 Jul 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 71 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 85 22 0 02 Jul 2024
Monitoring Latent World States in Language Models with Propositional Probes Jiahai Feng Stuart Russell Jacob Steinhardt HILM 48 8 0 27 Jun 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 45 6 0 09 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 53 122 0 28 Mar 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 39 90 0 11 Jan 2024
Interventionally Consistent Surrogates for Agent-based Simulators Joel Dyer Nicholas Bishop Yorgos Felekis Fabio Massimo Zennaro Anisoara Calinescu Theodoros Damoulas Michael Wooldridge 19 6 0 18 Dec 2023
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals Yanai Elazar Bhargavi Paranjape Hao Peng Sarah Wiegreffe Khyathi Raghavi Vivek Srikumar Sameer Singh Noah A. Smith AAML OOD 34 0 0 16 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 78 7 0 07 Nov 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 40 27 0 26 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers Anna Langedijk Hosein Mohebbi Gabriele Sarti Willem H. Zuidema Jaap Jumelet 32 10 0 05 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 41 101 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 33 347 0 15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 34 20 0 27 Aug 2023
Arithmetic with Language Models: from Memorization to Computation Davide Maltoni Matteo Ferrara KELM LRM 47 5 0 02 Aug 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 41 12 0 27 Jul 2023
Causal interventions expose implicit situation models for commonsense language understanding Takateru Yamakoshi James L. McClelland A. Goldberg Robert D. Hawkins 34 6 0 06 Jun 2023
Causal Analysis for Robust Interpretability of Neural Networks Ola Ahmad Nicolas Béreux Loïc Baret V. Hashemi Freddy Lecue CML 34 3 0 15 May 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 40 6 0 13 Apr 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 42 85 0 12 Apr 2023
A Discerning Several Thousand Judgments: GPT-3 Rates the Article + Adjective + Numeral + Noun Construction Kyle Mahowald 24 24 0 29 Jan 2023
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training Jing-ling Huang Zhengxuan Wu Kyle Mahowald Christopher Potts 29 13 0 19 Dec 2022
Causal Abstraction with Soft Interventions Riccardo Massidda Atticus Geiger Thomas Icard D. Bacciu 28 13 0 22 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 218 515 0 01 Nov 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 83 35 0 28 Sep 2022
FACT: Learning Governing Abstractions Behind Integer Sequences Peter Belcak Ard Kastrati Flavio Schenker Roger Wattenhofer 43 5 0 20 Sep 2022
Leveraging Explanations in Interactive Machine Learning: An Overview Stefano Teso Öznur Alkan Wolfgang Stammer Elizabeth M. Daly XAI FAtt LRM 28 62 0 29 Jul 2022