RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

27 February 2024

ArXiv (abs)PDF HTML Github (47★)

Papers citing "RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations"

35 / 35 papers shown

Title
Do Language Models Use Their Depth Efficiently? Róbert Csordás Christopher D. Manning Christopher Potts 200 2 0 20 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 103 0 0 15 May 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 105 2 0 13 Mar 2025
Identifying Sub-networks in Neural Networks via Functionally Similar Representations Tian Gao Amit Dhurandhar Karthikeyan N. Ramamurthy Dennis L. Wei 84 0 0 21 Oct 2024
Inference and Verbalization Functions During In-Context Learning Junyi Tao Xiaoyin Chen Nelson F. Liu LRM ReLM 75 1 0 12 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 76 5 0 06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 174 33 0 02 Jul 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 124 114 0 11 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability Edmund Mills Shiye Su Stuart J. Russell Scott Emmons 145 9 0 20 Dec 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 78 29 0 26 Oct 2023
How do Language Models Bind Entities in Context? Jiahai Feng Jacob Steinhardt 123 40 0 26 Oct 2023
FIND: A Function Description Benchmark for Evaluating Interpretability Methods Sarah Schwettmann Tamar Rott Shaham Joanna Materzyñska Neil Chowdhury Shuang Li Jacob Andreas David Bau Antonio Torralba 40 22 0 07 Sep 2023
Linearity of Relation Decoding in Transformer Language Models Evan Hernandez Arnab Sen Sharma Tal Haklay Kevin Meng Martin Wattenberg Jacob Andreas Yonatan Belinkov David Bau KELM 82 100 0 17 Aug 2023
Discovering Variable Binding Circuitry with Desiderata Xander Davies Max Nadeau Nikhil Prakash Tamar Rott Shaham David Bau 64 15 0 07 Jul 2023
LEACE: Perfect linear concept erasure in closed form Nora Belrose David Schneider-Joseph Shauli Ravfogel Ryan Cotterell Edward Raff Stella Biderman KELM MU 137 120 0 06 Jun 2023
Faithfulness Tests for Natural Language Explanations Pepa Atanasova Oana-Maria Camburu Christina Lioma Thomas Lukasiewicz J. Simonsen Isabelle Augenstein FAtt 114 67 0 29 May 2023
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions Zexuan Zhong Zhengxuan Wu Christopher D. Manning Christopher Potts Danqi Chen KELM 84 217 0 24 May 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 66 319 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 314 563 0 01 Nov 2022
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Eldar David Abraham Karel DÓosterlinck Amir Feder Y. Gat Atticus Geiger Christopher Potts Roi Reichart Zhengxuan Wu CML 120 47 0 27 May 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 251 1,389 0 10 Feb 2022
Sparse Interventions in Language Models with Differentiable Masking Nicola De Cao Leon Schmid Dieuwke Hupkes Ivan Titov 70 29 0 13 Dec 2021
Conditional probing: measuring usable information beyond a baseline John Hewitt Kawin Ethayarajh Percy Liang Christopher D. Manning 72 57 0 19 Sep 2021
Implicit Representations of Meaning in Neural Language Models Belinda Z. Li Maxwell Nye Jacob Andreas NAI MILM 67 177 0 01 Jun 2021
An Interpretability Illusion for BERT Tolga Bolukbasi Adam Pearce Ann Yuan Andy Coenen Emily Reif Fernanda Viégas Martin Wattenberg MILM FAtt 83 81 0 14 Apr 2021
Transformer Feed-Forward Layers Are Key-Value Memories Mor Geva R. Schuster Jonathan Berant Omer Levy KELM 182 847 0 29 Dec 2020
Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks Róbert Csordás Sjoerd van Steenkiste Jürgen Schmidhuber 98 97 0 05 Oct 2020
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection Shauli Ravfogel Yanai Elazar Hila Gonen Michael Twiton Yoav Goldberg 144 388 0 16 Apr 2020
What Does BERT Look At? An Analysis of BERT's Attention Kevin Clark Urvashi Khandelwal Omer Levy Christopher D. Manning MILM 235 1,605 0 11 Jun 2019
BERT Rediscovers the Classical NLP Pipeline Ian Tenney Dipanjan Das Ellie Pavlick MILM SSeg 145 1,482 0 15 May 2019
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations Francesco Locatello Stefan Bauer Mario Lucic Gunnar Rätsch Sylvain Gelly Bernhard Schölkopf Olivier Bachem OOD 139 1,473 0 29 Nov 2018
Dissecting Contextual Word Embeddings: Architecture and Representation Matthew E. Peters Mark Neumann Luke Zettlemoyer Wen-tau Yih 109 431 0 27 Aug 2018
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 353 897 0 03 May 2018
Linear Algebraic Structure of Word Senses, with Applications to Polysemy Sanjeev Arora Yuanzhi Li Yingyu Liang Tengyu Ma Andrej Risteski 95 284 0 14 Jan 2016
Direct and Indirect Effects Judea Pearl CML 100 2,179 0 10 Jan 2013