Interpretability Illusions in the Generalization of Simplified Models

v1v2 (latest)

Interpretability Illusions in the Generalization of Simplified Models

6 December 2023

Andrew Kyle Lampinen

Asma Ghandeharioun

ArXiv (abs)PDF HTML

Papers citing "Interpretability Illusions in the Generalization of Simplified Models"

13 / 13 papers shown

Title
Inherently Faithful Attention Maps for Vision Transformers Ananthu Aniraj C. Dantas Dino Ienco Diego Marcos OOD OCL 39 0 0 10 Jun 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 147 1 0 02 May 2025
Linguistic Interpretability of Transformer-based Language Models: a systematic review Miguel López-Otal Jorge Gracia Jordi Bernad Carlos Bobed Lucía Pitarch-Ballesteros Emma Anglés-Herrero VLM 108 1 0 09 Apr 2025
LangVAE and LangSpace: Building and Probing for Language Model VAEs Danilo S. Carvalho Yingji Zhang Harriet Unsworth André Freitas 91 0 0 29 Mar 2025
Mind the Gap: Bridging the Divide Between AI Aspirations and the Reality of Autonomous Characterization Grace Guinan Addison Salvador Michelle A. Smeaton Andrew Glaws Hilary Egan Brian C. Wyatt Babak Anasori K. Fiedler M. Olszta Steven Spurgeon 124 0 0 25 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Xiang Wang Yan Hu Wenyu Du Reynold Cheng Benyou Wang Difan Zou 157 3 0 17 Feb 2025
Information Anxiety in Large Language Models Prasoon Bajpai Sarah Masud Tanmoy Chakraborty 69 0 0 16 Nov 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 137 158 0 22 Apr 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 93 43 0 26 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models Adithya Bhaskar Dan Friedman Danqi Chen 114 7 0 06 Mar 2024
How do Large Language Models Handle Multilingualism? Yiran Zhao Wenxuan Zhang Guizhen Chen Kenji Kawaguchi Lidong Bing LRM 108 81 0 29 Feb 2024
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications Weize Liu Yinlong Xu Hongxia Xu Jintai Chen Xuming Hu Jian Wu 67 0 0 26 Feb 2024
Getting aligned on representational alignment Ilia Sucholutsky Lukas Muttenthaler Adrian Weller Andi Peng Andreea Bobu ... Thomas Unterthiner Andrew Kyle Lampinen Klaus-Robert Muller M. Toneva Thomas Griffiths 158 93 0 18 Oct 2023