ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.03656
  4. Cited By
Interpretability Illusions in the Generalization of Simplified Models
v1v2 (latest)

Interpretability Illusions in the Generalization of Simplified Models

6 December 2023
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
ArXiv (abs)PDFHTML

Papers citing "Interpretability Illusions in the Generalization of Simplified Models"

13 / 13 papers shown
Title
Inherently Faithful Attention Maps for Vision Transformers
Inherently Faithful Attention Maps for Vision Transformers
Ananthu Aniraj
C. Dantas
Dino Ienco
Diego Marcos
OODOCL
39
0
0
10 Jun 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
147
1
0
02 May 2025
Linguistic Interpretability of Transformer-based Language Models: a systematic review
Linguistic Interpretability of Transformer-based Language Models: a systematic review
Miguel López-Otal
Jorge Gracia
Jordi Bernad
Carlos Bobed
Lucía Pitarch-Ballesteros
Emma Anglés-Herrero
VLM
108
1
0
09 Apr 2025
LangVAE and LangSpace: Building and Probing for Language Model VAEs
LangVAE and LangSpace: Building and Probing for Language Model VAEs
Danilo S. Carvalho
Yingji Zhang
Harriet Unsworth
André Freitas
91
0
0
29 Mar 2025
Mind the Gap: Bridging the Divide Between AI Aspirations and the Reality of Autonomous Characterization
Mind the Gap: Bridging the Divide Between AI Aspirations and the Reality of Autonomous Characterization
Grace Guinan
Addison Salvador
Michelle A. Smeaton
Andrew Glaws
Hilary Egan
Brian C. Wyatt
Babak Anasori
K. Fiedler
M. Olszta
Steven Spurgeon
124
0
0
25 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Xiang Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
157
3
0
17 Feb 2025
Information Anxiety in Large Language Models
Prasoon Bajpai
Sarah Masud
Tanmoy Chakraborty
69
0
0
16 Nov 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
137
158
0
22 Apr 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding
  Model Mechanisms
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
93
43
0
26 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in
  Pretrained Language Models
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
Adithya Bhaskar
Dan Friedman
Danqi Chen
114
7
0
06 Mar 2024
How do Large Language Models Handle Multilingualism?
How do Large Language Models Handle Multilingualism?
Yiran Zhao
Wenxuan Zhang
Guizhen Chen
Kenji Kawaguchi
Lidong Bing
LRM
108
81
0
29 Feb 2024
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and
  Their Applications
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications
Weize Liu
Yinlong Xu
Hongxia Xu
Jintai Chen
Xuming Hu
Jian Wu
67
0
0
26 Feb 2024
Getting aligned on representational alignment
Getting aligned on representational alignment
Ilia Sucholutsky
Lukas Muttenthaler
Adrian Weller
Andi Peng
Andreea Bobu
...
Thomas Unterthiner
Andrew Kyle Lampinen
Klaus-Robert Muller
M. Toneva
Thomas Griffiths
158
93
0
18 Oct 2023
1