v1v2 (latest)

Finding Neurons in a Haystack: Case Studies with Sparse Probing

2 May 2023

Papers citing "Finding Neurons in a Haystack: Case Studies with Sparse Probing"

50 / 60 papers shown

Title
ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models Hao Chen Haoze Li Zhiqing Xiao Lirong Gao Qi Zhang Xiaomeng Hu Ningtao Wang Xing Fu Junbo Zhao 174 0 0 24 May 2025
Understanding Gated Neurons in Transformers from Their Input-Output Functionality Sebastian Gerstner Hinrich Schütze MILM FAtt 188 0 0 23 May 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Zhen Zhang Biwei Huang Anton van den Hengel Javen Qinfeng Shi Javen Qinfeng Shi 445 0 0 12 Mar 2025
Exploiting Edited Large Language Models as General Scientific Optimizers Qitan Lv T. Liu Haoyu Wang 151 1 0 08 Mar 2025
Discovering Chunks in Neural Embeddings for Interpretability Shuchen Wu Stephan Alaniz Eric Schulz Zeynep Akata 85 0 0 03 Feb 2025
Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference Go Kamoda Benjamin Heinzerling Tatsuro Inaba Keito Kudo Keisuke Sakaguchi Kentaro Inui MILM 91 3 0 27 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 99 2 0 09 Jan 2025
Improving Object Detection by Modifying Synthetic Data with Explainable AI Nitish Mital Simon Malzard Richard Walters Celso M. De Melo Raghuveer Rao Victoria Nockles 123 0 0 02 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Zeping Yu Sophia Ananiadou 434 2 0 17 Nov 2024
Math Neurosurgery: Isolating Language Models' Math Reasoning Abilities Using Only Forward Passes Bryan R Christ Zack Gottesman Jonathan Kropko Thomas Hartvigsen LRM 112 4 0 22 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Sihang Li Yongbin Li 129 9 0 17 Oct 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning Wei Chen Zhen Huang Liang Xie Binbin Lin Houqiang Li ... Deng Cai Yonggang Zhang Wenxiao Wang Xu Shen Jieping Ye 108 9 0 03 Sep 2024
Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models Chenhui Hu Pengfei Cao Yubo Chen Kang Liu Jun Zhao KELM 110 3 0 14 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 156 32 0 02 Jul 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 113 15 0 13 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 89 38 0 03 Jun 2024
A Multimodal Automated Interpretability Agent Tamar Rott Shaham Sarah Schwettmann Franklin Wang Achyuta Rajaram Evan Hernandez Jacob Andreas Antonio Torralba 193 26 0 22 Apr 2024
Impossibility Theorems for Feature Attribution Blair Bilodeau Natasha Jaques Pang Wei Koh Been Kim FAtt 61 76 0 22 Dec 2022
On the Relationship Between Explanation and Prediction: A Causal View Amir-Hossein Karimi Krikamol Muandet Simon Kornblith Bernhard Schölkopf Been Kim FAtt CML 63 14 0 13 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision Collin Burns Haotian Ye Dan Klein Jacob Steinhardt 136 375 0 07 Dec 2022
Interpreting Neural Networks through the Polytope Lens Sid Black Lee D. Sharkey Léo Grinsztajn Eric Winsor Daniel A. Braun ... Kip Parker Carlos Ramón Guevara Beren Millidge Gabriel Alfour Connor Leahy FAtt MILM 62 26 0 22 Nov 2022
Engineering Monosemanticity in Toy Models Adam Jermyn Nicholas Schiefer Evan Hubinger MILM 52 10 0 16 Nov 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models Xiaozhi Wang Kaiyue Wen Zhengyan Zhang Lei Hou Zhiyuan Liu Juanzi Li MILM MoE 56 51 0 14 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 174 30 0 04 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 316 516 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 183 368 0 21 Sep 2022
Analyzing Transformers in Embedding Space Guy Dar Mor Geva Ankit Gupta Jonathan Berant 58 91 0 06 Sep 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 105 192 0 30 Aug 2022
Discovering Salient Neurons in Deep NLP Models Nadir Durrani Fahim Dalvi Hassan Sajjad KELM MILM 71 16 0 27 Jun 2022
Is Power-Seeking AI an Existential Risk? Joseph Carlsmith ELM 62 87 0 16 Jun 2022
Emergent Abilities of Large Language Models Jason W. Wei Yi Tay Rishi Bommasani Colin Raffel Barret Zoph ... Tatsunori Hashimoto Oriol Vinyals Percy Liang J. Dean W. Fedus ELM ReLM LRM 279 2,480 0 15 Jun 2022
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra ... Kathy Meier-Hellstern Douglas Eck J. Dean Slav Petrov Noah Fiedel PILM LRM 498 6,240 0 05 Apr 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 248 1,357 0 10 Feb 2022
Sparse Interventions in Language Models with Differentiable Masking Nicola De Cao Leon Schmid Dieuwke Hupkes Ivan Titov 63 28 0 13 Dec 2021
On the Pitfalls of Analyzing Individual Neurons in Language Models Omer Antverg Yonatan Belinkov MILM 64 53 0 14 Oct 2021
Neuron-level Interpretation of Deep NLP Models: A Survey Hassan Sajjad Nadir Durrani Fahim Dalvi MILM AI4CE 71 84 0 30 Aug 2021
Probing Across Time: What Does RoBERTa Know and When? Leo Z. Liu Yizhong Wang Jungo Kasai Hannaneh Hajishirzi Noah A. Smith KELM 81 85 0 16 Apr 2021
Low-Complexity Probing via Finding Subnetworks Steven Cao Victor Sanh Alexander M. Rush 43 54 0 08 Apr 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 286 452 0 24 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 450 2,096 0 31 Dec 2020
Transformer Feed-Forward Layers Are Key-Value Memories Mor Geva R. Schuster Jonathan Berant Omer Levy KELM 161 828 0 29 Dec 2020
Intrinsic Probing through Dimension Selection Lucas Torroba Hennigen Adina Williams Ryan Cotterell 54 58 0 06 Oct 2020
Understanding the Role of Individual Units in a Deep Neural Network David Bau Jun-Yan Zhu Hendrik Strobelt Àgata Lapedriza Bolei Zhou Antonio Torralba GAN 69 451 0 10 Sep 2020
Finding Experts in Transformer Models Xavier Suau Luca Zappella N. Apostoloff 48 31 0 15 May 2020
Information-Theoretic Probing with Minimum Description Length Elena Voita Ivan Titov 85 275 0 27 Mar 2020
Designing and Interpreting Probes with Control Tasks John Hewitt Percy Liang 76 537 0 08 Sep 2019
What do you learn from context? Probing for sentence structure in contextualized word representations Ian Tenney Patrick Xia Berlin Chen Alex Jinpeng Wang Adam Poliak ... Najoung Kim Benjamin Van Durme Samuel R. Bowman Dipanjan Das Ellie Pavlick 180 861 0 15 May 2019
BERT Rediscovers the Classical NLP Pipeline Ian Tenney Dipanjan Das Ellie Pavlick MILM SSeg 138 1,476 0 15 May 2019
What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models Fahim Dalvi Nadir Durrani Hassan Sajjad Yonatan Belinkov A. Bau James R. Glass MILM 64 191 0 21 Dec 2018
Sanity Checks for Saliency Maps Julius Adebayo Justin Gilmer M. Muelly Ian Goodfellow Moritz Hardt Been Kim FAtt AAML XAI 139 1,967 0 08 Oct 2018