Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.00593
Cited By
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
1 November 2022
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"
50 / 125 papers shown
Title
Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation
Reilly Haskins
Benjamin Adams
16
0
0
16 May 2025
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning
Jingcheng Niu
Subhabrata Dutta
Ahmed Elshabrawy
Harish Tayyar Madabushi
Iryna Gurevych
31
0
0
16 May 2025
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen
Jiaying Zhu
Xinyu Yang
Wenya Wang
LRM
19
0
0
15 May 2025
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
Jingcheng Niu
Xingdi Yuan
Tong Wang
Hamidreza Saghir
Amir H. Abdi
27
0
0
14 May 2025
Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
Leon Eshuijs
Shihan Wang
Antske Fokkens
31
0
0
09 May 2025
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu
Kayo Yin
Michael I. Jordan
Jacob Steinhardt
Lijie Chen
56
0
0
08 May 2025
Geospatial Mechanistic Interpretability of Large Language Models
Stef De Sabbata
Stefano Mizzaro
Kevin Roitero
AI4CE
37
0
0
06 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
88
1
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
90
1
0
01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
Jie Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J.N. Zhang
Xipeng Qiu
70
0
0
29 Apr 2025
Model Connectomes: A Generational Approach to Data-Efficient Language Models
Klemen Kotar
Greta Tuckute
60
0
0
29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
72
0
0
28 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering
Bertram Højer
Oliver Jarvis
Stefan Heinrich
LRM
83
2
0
28 Apr 2025
Structural Inference: Interpreting Small Language Models with Susceptibilities
Garrett Baker
George Wang
Jesse Hoogland
Daniel Murfet
AAML
81
1
0
25 Apr 2025
Do Large Language Models know who did what to whom?
Joseph M. Denning
Xiaohan
Bryor Snefjella
Idan A. Blank
67
1
0
23 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Tyler A. Chang
Benjamin Bergen
55
0
0
21 Apr 2025
Understanding the Repeat Curse in Large Language Models from a Feature Perspective
Junchi Yao
Shu Yang
Jianhua Xu
Lijie Hu
Mengdi Li
Di Wang
29
0
0
19 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
51
1
0
17 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
Hari Sethuraman
Dheeraj Rajagopal
Hannaneh Hajishirzi
Sachin Kumar
LLMSV
39
1
0
06 Apr 2025
Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
Aleksandra Bakalova
Yana Veitsman
Xinting Huang
Michael Hahn
36
0
0
31 Mar 2025
Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
52
0
0
14 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
75
2
0
10 Mar 2025
Exploiting Edited Large Language Models as General Scientific Optimizers
Qitan Lv
T. Liu
Haoyu Wang
46
0
0
08 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
68
0
0
08 Mar 2025
How can representation dimension dominate structurally pruned LLMs?
Mingxue Xu
Lisa Alazraki
Danilo Mandic
56
0
0
06 Mar 2025
(How) Do Language Models Track State?
Belinda Z. Li
Zifan Carl Guo
Jacob Andreas
LRM
52
1
0
04 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
65
0
0
03 Mar 2025
Re-evaluating Theory of Mind evaluation in large language models
Jennifer Hu
Felix Sosa
T. Ullman
45
1
0
28 Feb 2025
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Yifan Zhang
Wenyu Du
Dongming Jin
Jie Fu
Zhi Jin
LRM
53
0
0
27 Feb 2025
Quantifying Logical Consistency in Transformers via Query-Key Alignment
Eduard Tulchinskii
Anastasia Voznyuk
Laida Kushnareva
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
LRM
68
0
0
24 Feb 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
67
0
0
24 Feb 2025
Revealing and Mitigating Over-Attention in Knowledge Editing
Pinzheng Wang
Zecheng Tang
Keyan Zhou
J. Li
Qiaoming Zhu
Mengdi Zhang
KELM
124
2
0
21 Feb 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Anton Razzhigaev
Matvey Mikhalchuk
Temurbek Rahmatullaev
Elizaveta Goncharova
Polina Druzhinina
Ivan Oseledets
Andrey Kuznetsov
69
3
0
20 Feb 2025
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management
Simeon Campos
Henry Papadatos
Fabien Roger
Chloé Touzet
Malcolm Murray
Otter Quarks
100
2
0
20 Feb 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Cristian Gutierrez
LRM
69
0
0
17 Feb 2025
Exploring Translation Mechanism of Large Language Models
Hongbin Zhang
Kehai Chen
Xuefeng Bai
Xiucheng Li
Yang Xiang
Min Zhang
67
1
0
17 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution
Shichang Zhang
Tessa Han
Usha Bhalla
Hima Lakkaraju
FAtt
157
0
0
17 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
Lefei Zhang
Lijie Hu
Di Wang
LRM
100
0
0
17 Feb 2025
Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning
Gangwei Jiang
Caigao Jiang
Zhaoyi Li
Siqiao Xue
Jun-ping Zhou
Linqi Song
Defu Lian
Yin Wei
CLL
MU
63
1
0
16 Feb 2025
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Da Xiao
Qingye Meng
Shengping Li
Xingyuan Yuan
MoE
AI4CE
68
1
0
13 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models
Ala Nekouvaght Tak
Amin Banayeeanzade
Anahita Bolourani
Mina Kian
Robin Jia
Jonathan Gratch
54
0
0
08 Feb 2025
Constrained belief updates explain geometric structures in transformer representations
Mateusz Piotrowski
P. Riechers
Daniel Filan
A. Shai
76
0
0
04 Feb 2025
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
47
0
0
03 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
Sengim Karayalçin
Marina Krček
Stjepan Picek
AAML
80
0
0
01 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
53
1
0
09 Jan 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song
Zhuoyan Xu
Yiqiao Zhong
88
4
0
31 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
85
16
0
21 Nov 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
205
0
0
17 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
62
3
0
17 Nov 2024
More Expressive Attention with Negative Weights
Ang Lv
Ruobing Xie
Shuaipeng Li
Jiayi Liao
Xingchen Sun
Zhanhui Kang
Di Wang
Rui Yan
42
0
0
11 Nov 2024
1
2
3
Next