Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.02646
Cited By
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
2 July 2024
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models"
32 / 32 papers shown
Title
LoRASuite: Efficient LoRA Adaptation Across Large Language Model Upgrades
Yanan Li
Fanxu Meng
Muhan Zhang
Shiai Zhu
Shangguang Wang
Mengwei Xu
MoMe
2
0
0
17 May 2025
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen
Jiaying Zhu
Xinyu Yang
Wenya Wang
LRM
14
0
0
15 May 2025
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
Pranav Kantroo
Günter P. Wagner
Benjamin B. Machta
47
0
0
23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
91
1
0
22 Apr 2025
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Christopher Wolfram
Aaron Schein
34
1
0
03 Apr 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
65
0
0
23 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Yalan Qin
Yaodong Yang
50
1
0
22 Feb 2025
An explainable transformer circuit for compositional generalization
Cheng Tang
Brenden Lake
Mehrdad Jazayeri
LRM
44
0
0
19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Xuben Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
61
1
0
17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
100
0
0
10 Feb 2025
Interpretable Language Modeling via Induction-head Ngram Models
Eunji Kim
Sriya Mantena
Weiwei Yang
Chandan Singh
Sungroh Yoon
Jianfeng Gao
65
0
0
31 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
40
13
0
28 Oct 2024
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework
Angela van Sprang
Erman Acar
Willem Zuidema
AI4TS
51
1
0
08 Oct 2024
System 2 Reasoning Capabilities Are Nigh
Scott C. Lowe
VLM
LRM
51
0
0
04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
72
1
0
03 Oct 2024
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
76
28
0
04 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
45
42
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
59
28
0
27 Feb 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang
Shengding Hu
Xu Han
Zhiyuan Liu
Maosong Sun
70
14
0
23 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Philip Quirke
Clement Neo
Fazl Barez
KELM
LRM
41
3
0
04 Feb 2024
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
105
39
0
22 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
68
57
0
16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
165
190
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
200
266
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
348
2,232
0
22 Mar 2023
Crawling the Internal Knowledge-Base of Language Models
Roi Cohen
Mor Geva
Jonathan Berant
Amir Globerson
186
77
0
30 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
507
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
463
0
24 Sep 2022
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
Aman Madaan
Amir Yazdanbakhsh
LRM
154
116
0
16 Sep 2022
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
449
2,589
0
03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
201
883
0
03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
257
3,690
0
28 Feb 2017
1