Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.16042
Cited By
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
27 September 2023
Fred Zhang
Neel Nanda
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"
37 / 87 papers shown
Title
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
25
1
0
18 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky
Philippe Chlenski
Neel Nanda
27
23
0
17 Jun 2024
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner
Shreyas Kapur
Vasil Georgiev
Cameron Allen
Scott Emmons
Stuart J. Russell
32
10
0
02 Jun 2024
Exploring and steering the moral compass of Large Language Models
Alejandro Tlaie
LLMSV
32
3
0
27 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
51
17
0
25 May 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Boshi Wang
Xiang Yue
Yu-Chuan Su
Huan Sun
LRM
29
41
0
23 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles OÑeill
Thang Bui
40
5
0
21 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
66
13
0
06 May 2024
How to use and interpret activation patching
Stefan Heimersheim
Neel Nanda
35
37
0
23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
114
0
22 Apr 2024
Finding Visual Task Vectors
Alberto Hojel
Yutong Bai
Trevor Darrell
Amir Globerson
Amir Bar
64
6
0
08 Apr 2024
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
73
28
0
04 Apr 2024
Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph
Marco Bronzini
Carlo Nicolini
Bruno Lepri
Jacopo Staiano
Andrea Passerini
KELM
28
5
0
04 Apr 2024
On Large Language Models' Hallucination with Regard to Known Facts
Che Jiang
Biqing Qi
Xiangyu Hong
Dayuan Fu
Yang Cheng
Fandong Meng
Mo Yu
Bowen Zhou
Jie Zhou
HILM
LRM
31
16
0
29 Mar 2024
Localizing Paragraph Memorization in Language Models
Niklas Stoehr
Mitchell Gordon
Chiyuan Zhang
Owen Lewis
MU
38
13
0
28 Mar 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
Ang Lv
Yuhan Chen
Kaiyi Zhang
Yulong Wang
Lifeng Liu
Ji-Rong Wen
Jian Xie
Rui Yan
KELM
32
16
0
28 Mar 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
51
34
0
26 Mar 2024
Monotonic Representation of Numeric Properties in Language Models
Benjamin Heinzerling
Kentaro Inui
KELM
MILM
45
9
0
15 Mar 2024
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
Michael Toker
Hadas Orgad
Mor Ventura
Dana Arad
Yonatan Belinkov
DiffM
64
12
0
09 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
Adithya Bhaskar
Dan Friedman
Danqi Chen
35
5
0
06 Mar 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Subhabrata Dutta
Joykirat Singh
Soumen Chakrabarti
Tanmoy Chakraborty
LRM
43
23
0
28 Feb 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models
Zhuoran Jin
Pengfei Cao
Hongbang Yuan
Yubo Chen
Jiexin Xu
Huaijun Li
Xiaojian Jiang
Kang Liu
Jun Zhao
183
36
0
28 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
Zhengfu He
Xuyang Ge
Qiong Tang
Tianxiang Sun
Qinyuan Cheng
Xipeng Qiu
39
20
0
19 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
91
21
0
14 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
34
87
0
11 Jan 2024
Neuron-Level Knowledge Attribution in Large Language Models
Zeping Yu
Sophia Ananiadou
FAtt
KELM
24
7
0
19 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Tony T. Wang
Miles Wang
Kaivu Hariharan
Nir Shavit
21
2
0
14 Dec 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
35
5
0
11 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
160
186
0
02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
189
120
0
30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
191
261
0
28 Apr 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
120
61
0
07 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
75
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
135
25
0
04 Oct 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
460
0
24 Sep 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
201
117
0
26 Jan 2022
Previous
1
2