Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.12631
Cited By
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
23 January 2024
Zhengxuan Wu
Atticus Geiger
Jing-ling Huang
Aryaman Arora
Thomas F. Icard
Christopher Potts
Noah D. Goodman
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments"
2 / 2 papers shown
Title
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
1