Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.01610
Cited By
v1
v2 (latest)
Finding Neurons in a Haystack: Case Studies with Sparse Probing
2 May 2023
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Finding Neurons in a Haystack: Case Studies with Sparse Probing"
50 / 60 papers shown
Title
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Hao Chen
Haoze Li
Zhiqing Xiao
Lirong Gao
Qi Zhang
Xiaomeng Hu
Ningtao Wang
Xing Fu
Junbo Zhao
174
0
0
24 May 2025
Understanding Gated Neurons in Transformers from Their Input-Output Functionality
Sebastian Gerstner
Hinrich Schütze
MILM
FAtt
188
0
0
23 May 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Zhen Zhang
Biwei Huang
Anton van den Hengel
Javen Qinfeng Shi
Javen Qinfeng Shi
445
0
0
12 Mar 2025
Exploiting Edited Large Language Models as General Scientific Optimizers
Qitan Lv
T. Liu
Haoyu Wang
151
1
0
08 Mar 2025
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
85
0
0
03 Feb 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference
Go Kamoda
Benjamin Heinzerling
Tatsuro Inaba
Keito Kudo
Keisuke Sakaguchi
Kentaro Inui
MILM
91
3
0
27 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
99
2
0
09 Jan 2025
Improving Object Detection by Modifying Synthetic Data with Explainable AI
Nitish Mital
Simon Malzard
Richard Walters
Celso M. De Melo
Raghuveer Rao
Victoria Nockles
123
0
0
02 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
434
2
0
17 Nov 2024
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes
Bryan R Christ
Zack Gottesman
Jonathan Kropko
Thomas Hartvigsen
LRM
112
4
0
22 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Sihang Li
Yongbin Li
129
9
0
17 Oct 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
108
9
0
03 Sep 2024
Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models
Chenhui Hu
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
KELM
110
3
0
14 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
156
32
0
02 Jul 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
113
15
0
13 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
89
38
0
03 Jun 2024
A Multimodal Automated Interpretability Agent
Tamar Rott Shaham
Sarah Schwettmann
Franklin Wang
Achyuta Rajaram
Evan Hernandez
Jacob Andreas
Antonio Torralba
193
26
0
22 Apr 2024
Impossibility Theorems for Feature Attribution
Blair Bilodeau
Natasha Jaques
Pang Wei Koh
Been Kim
FAtt
61
76
0
22 Dec 2022
On the Relationship Between Explanation and Prediction: A Causal View
Amir-Hossein Karimi
Krikamol Muandet
Simon Kornblith
Bernhard Schölkopf
Been Kim
FAtt
CML
63
14
0
13 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
136
375
0
07 Dec 2022
Interpreting Neural Networks through the Polytope Lens
Sid Black
Lee D. Sharkey
Léo Grinsztajn
Eric Winsor
Daniel A. Braun
...
Kip Parker
Carlos Ramón Guevara
Beren Millidge
Gabriel Alfour
Connor Leahy
FAtt
MILM
62
26
0
22 Nov 2022
Engineering Monosemanticity in Toy Models
Adam Jermyn
Nicholas Schiefer
Evan Hubinger
MILM
52
10
0
16 Nov 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Xiaozhi Wang
Kaiyue Wen
Zhengyan Zhang
Lei Hou
Zhiyuan Liu
Juanzi Li
MILM
MoE
56
51
0
14 Nov 2022
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
174
30
0
04 Oct 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
316
516
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
183
368
0
21 Sep 2022
Analyzing Transformers in Embedding Space
Guy Dar
Mor Geva
Ankit Gupta
Jonathan Berant
58
91
0
06 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
105
192
0
30 Aug 2022
Discovering Salient Neurons in Deep NLP Models
Nadir Durrani
Fahim Dalvi
Hassan Sajjad
KELM
MILM
71
16
0
27 Jun 2022
Is Power-Seeking AI an Existential Risk?
Joseph Carlsmith
ELM
62
87
0
16 Jun 2022
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
279
2,480
0
15 Jun 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
498
6,240
0
05 Apr 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
248
1,357
0
10 Feb 2022
Sparse Interventions in Language Models with Differentiable Masking
Nicola De Cao
Leon Schmid
Dieuwke Hupkes
Ivan Titov
63
28
0
13 Dec 2021
On the Pitfalls of Analyzing Individual Neurons in Language Models
Omer Antverg
Yonatan Belinkov
MILM
64
53
0
14 Oct 2021
Neuron-level Interpretation of Deep NLP Models: A Survey
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
MILM
AI4CE
71
84
0
30 Aug 2021
Probing Across Time: What Does RoBERTa Know and When?
Leo Z. Liu
Yizhong Wang
Jungo Kasai
Hannaneh Hajishirzi
Noah A. Smith
KELM
81
85
0
16 Apr 2021
Low-Complexity Probing via Finding Subnetworks
Steven Cao
Victor Sanh
Alexander M. Rush
43
54
0
08 Apr 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
286
452
0
24 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
450
2,096
0
31 Dec 2020
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
161
828
0
29 Dec 2020
Intrinsic Probing through Dimension Selection
Lucas Torroba Hennigen
Adina Williams
Ryan Cotterell
54
58
0
06 Oct 2020
Understanding the Role of Individual Units in a Deep Neural Network
David Bau
Jun-Yan Zhu
Hendrik Strobelt
Àgata Lapedriza
Bolei Zhou
Antonio Torralba
GAN
69
451
0
10 Sep 2020
Finding Experts in Transformer Models
Xavier Suau
Luca Zappella
N. Apostoloff
48
31
0
15 May 2020
Information-Theoretic Probing with Minimum Description Length
Elena Voita
Ivan Titov
85
275
0
27 Mar 2020
Designing and Interpreting Probes with Control Tasks
John Hewitt
Percy Liang
76
537
0
08 Sep 2019
What do you learn from context? Probing for sentence structure in contextualized word representations
Ian Tenney
Patrick Xia
Berlin Chen
Alex Jinpeng Wang
Adam Poliak
...
Najoung Kim
Benjamin Van Durme
Samuel R. Bowman
Dipanjan Das
Ellie Pavlick
180
861
0
15 May 2019
BERT Rediscovers the Classical NLP Pipeline
Ian Tenney
Dipanjan Das
Ellie Pavlick
MILM
SSeg
138
1,476
0
15 May 2019
What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models
Fahim Dalvi
Nadir Durrani
Hassan Sajjad
Yonatan Belinkov
A. Bau
James R. Glass
MILM
64
191
0
21 Dec 2018
Sanity Checks for Saliency Maps
Julius Adebayo
Justin Gilmer
M. Muelly
Ian Goodfellow
Moritz Hardt
Been Kim
FAtt
AAML
XAI
139
1,967
0
08 Oct 2018
1
2
Next