Information Flow Routes: Automatically Interpreting Language Models at Scale

27 February 2024

Papers citing "Information Flow Routes: Automatically Interpreting Language Models at Scale"

12 / 12 papers shown

Title
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 43 1 0 17 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 71 0 0 13 Mar 2025
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability ZhongXiang Sun Xiaoxue Zang Kai Zheng Yang Song Jun Xu Xiao Zhang Weijie Yu Yang Song Han Li 57 7 0 15 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 48 12 0 08 Oct 2024
Investigating the translation capabilities of Large Language Models trained on parallel data only Javier García Gilabert Carlos Escolano Aleix Sant Savall Francesca de Luca Fornaciari Audrey Mash Xixian Liao Maite Melero LRM 42 2 0 13 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 64 20 0 28 May 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 51 34 0 26 Mar 2024
Editing Conceptual Knowledge for Large Language Models Xiaohan Wang Shengyu Mao Ningyu Zhang Shumin Deng Yunzhi Yao Yue Shen Lei Liang Jinjie Gu Huajun Chen KELM 34 13 0 10 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 45 42 0 01 Mar 2024
Spectral Filters, Dark Signals, and Attention Sinks Nicola Cancedda 58 16 0 14 Feb 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 120 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 496 0 01 Nov 2022