A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

23 January 2024

Papers citing "A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments"

2 / 2 papers shown

Title
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 494 0 01 Nov 2022