Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.02646
Cited By
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
2 July 2024
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models"
31 / 31 papers shown
Title
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen
Jiaying Zhu
Xinyu Yang
Wenya Wang
LRM
9
0
0
15 May 2025
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
Pranav Kantroo
Günter P. Wagner
Benjamin B. Machta
45
0
0
23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
88
0
0
22 Apr 2025
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Christopher Wolfram
Aaron Schein
29
0
0
03 Apr 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
58
0
0
23 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Jiaming Ji
Yaodong Yang
40
1
0
22 Feb 2025
An explainable transformer circuit for compositional generalization
Cheng Tang
Brenden Lake
Mehrdad Jazayeri
LRM
39
0
0
19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
X. Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
56
0
0
17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
84
0
0
10 Feb 2025
Interpretable Language Modeling via Induction-head Ngram Models
Eunji Kim
Sriya Mantena
Weiwei Yang
Chandan Singh
Sungroh Yoon
Jianfeng Gao
49
0
0
31 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
40
13
0
28 Oct 2024
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework
Angela van Sprang
Erman Acar
Willem Zuidema
AI4TS
46
1
0
08 Oct 2024
System 2 Reasoning Capabilities Are Nigh
Scott C. Lowe
VLM
LRM
40
0
0
04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
67
1
0
03 Oct 2024
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
68
28
0
04 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
43
42
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
59
26
0
27 Feb 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang
Shengding Hu
Xu Han
Zhiyuan Liu
Maosong Sun
62
14
0
23 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Philip Quirke
Clement Neo
Fazl Barez
KELM
LRM
38
3
0
04 Feb 2024
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
94
37
0
22 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
62
55
0
16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
155
186
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
191
261
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
283
2,232
0
22 Mar 2023
Crawling the Internal Knowledge-Base of Language Models
Roi Cohen
Mor Geva
Jonathan Berant
Amir Globerson
175
76
0
30 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
458
0
24 Sep 2022
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
Aman Madaan
Amir Yazdanbakhsh
LRM
144
116
0
16 Sep 2022
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
413
2,584
0
03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
199
882
0
03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
242
3,681
0
28 Feb 2017
1