Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.00824
Cited By
Information Flow Routes: Automatically Interpreting Language Models at Scale
27 February 2024
Javier Ferrando
Elena Voita
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Information Flow Routes: Automatically Interpreting Language Models at Scale"
12 / 12 papers shown
Title
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
43
1
0
17 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
71
0
0
13 Mar 2025
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
ZhongXiang Sun
Xiaoxue Zang
Kai Zheng
Yang Song
Jun Xu
Xiao Zhang
Weijie Yu
Yang Song
Han Li
57
7
0
15 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
48
12
0
08 Oct 2024
Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert
Carlos Escolano
Aleix Sant Savall
Francesca de Luca Fornaciari
Audrey Mash
Xixian Liao
Maite Melero
LRM
42
2
0
13 Jun 2024
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
64
20
0
28 May 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
51
34
0
26 Mar 2024
Editing Conceptual Knowledge for Large Language Models
Xiaohan Wang
Shengyu Mao
Ningyu Zhang
Shumin Deng
Yunzhi Yao
Yue Shen
Lei Liang
Jinjie Gu
Huajun Chen
KELM
34
13
0
10 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
45
42
0
01 Mar 2024
Spectral Filters, Dark Signals, and Attention Sinks
Nicola Cancedda
58
16
0
14 Feb 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
189
120
0
30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
1