Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

1 November 2022

Papers citing "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"

50 / 128 papers shown

Title
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Zeping Yu Sophia Ananiadou 234 0 0 17 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Peng Kuang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 62 5 0 17 Nov 2024
More Expressive Attention with Negative Weights Ang Lv Ruobing Xie Shuaipeng Li Jiayi Liao Xingwu Sun Zhanhui Kang Di Wang Rui Yan 44 0 0 11 Nov 2024
Controllable Context Sensitivity and the Knob Behind It Julian Minder Kevin Du Niklas Stoehr Giovanni Monea Chris Wendler Robert West Ryan Cotterell KELM 63 4 0 11 Nov 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip Torr Francesco Pinto 57 0 0 30 Oct 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Yaniv Nikankin Anja Reusch Aaron Mueller Yonatan Belinkov AIFin LRM 46 25 0 28 Oct 2024
LLMScan: Causal Scan for LLM Misbehavior Detection Mengdi Zhang Kai Kiat Goh Peixin Zhang Jun Sun Rose Lin Xin Hongyu Zhang 28 0 0 22 Oct 2024
Identifying Sub-networks in Neural Networks via Functionally Similar Representations Tian Gao Amit Dhurandhar Karthikeyan N. Ramamurthy Dennis L. Wei 56 0 0 21 Oct 2024
Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models Qitan Lv Jie Wang Hanzhu Chen Bin Li Yongdong Zhang Feng Wu HILM 35 3 0 19 Oct 2024
On the Role of Attention Heads in Large Language Model Safety Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Sihang Li Yongbin Li 59 5 0 17 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 50 7 0 15 Oct 2024
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts Guorui Zheng Xidong Wang Juhao Liang Nuo Chen Yuping Zheng Benyou Wang MoE 40 5 0 14 Oct 2024
Unlearning-based Neural Interpretations Ching Lam Choi Alexandre Duplessis Serge Belongie FAtt 54 0 0 10 Oct 2024
Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo Luke Ong Philip Torr Mor Geva David M. Krueger Fazl Barez 92 6 0 09 Oct 2024
Round and Round We Go! What makes Rotary Positional Encodings useful? Federico Barbero Alex Vitvitskyi Christos Perivolaropoulos Razvan Pascanu Petar Velickovic 85 19 0 08 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 87 1 0 02 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Philipp Mondorf Sondre Wold Yun Xue 43 0 0 02 Oct 2024
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis Zeping Yu Sophia Ananiadou LRM MILM 32 7 0 21 Sep 2024
Extracting Paragraphs from LLM Token Activations Nicholas Pochinkov Angelo Benoit Lovkush Agarwal Zainab Ali Majid Lucile Ter-Minassian 32 1 0 10 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 39 3 0 06 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning Wei Chen Zhen Huang Liang Xie Binbin Lin Houqiang Li ... Deng Cai Yonggang Zhang Wenxiao Wang Xu Shen Jieping Ye 57 6 0 03 Sep 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models Geonhee Kim Marco Valentino André Freitas LRM AI4CE 38 8 0 16 Aug 2024
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data Mingshu Li 46 3 0 01 Aug 2024
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs Nitay Calderon Roi Reichart 47 13 0 27 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 58 7 0 11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 85 22 0 02 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 68 17 0 24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 35 1 0 24 Jun 2024
Unveiling LLM Mechanisms Through Neural ODEs and Control Theory Yukun Zhang Qi Dong 40 0 0 23 Jun 2024
Perception of Phonological Assimilation by Neural Speech Recognition Models Charlotte Pouw Marianne de Heer Kloots Afra Alishahi Willem H. Zuidema 49 2 0 21 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 58 13 0 13 Jun 2024
Understanding Information Storage and Transfer in Multi-modal Large Language Models Samyadeep Basu Martin Grayson C. Morrison Besmira Nushi S. Feizi Daniela Massiceti 30 10 0 06 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits Andis Draguns Andrew Gritsevskiy S. Motwani Charlie Rogers-Smith Jeffrey Ladish Christian Schroeder de Witt 48 2 0 03 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 74 20 0 28 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks Jacob Russin Sam Whitman McGrath Danielle J. Williams Lotem Elber-Dorozko AI4CE 88 3 0 24 May 2024
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator Subhash Kantamneni Ziming Liu Max Tegmark 19 2 0 23 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge? Jingcheng Niu Andrew Liu Zining Zhu Gerald Penn 50 31 0 03 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 100 487 0 30 Apr 2024
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs Valeriia Cherepanova James Zou AAML 35 4 0 26 Apr 2024
Does Transformer Interpretability Transfer to RNNs? Gonccalo Paulo Thomas Marshall Nora Belrose 65 6 0 09 Apr 2024
Eigenpruning: an Interpretability-Inspired PEFT Method Tomás Vergara-Browne Álvaro Soto A. Aizawa 44 1 0 04 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 53 122 0 28 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 49 26 0 29 Feb 2024
On the Societal Impact of Open Foundation Models Sayash Kapoor Rishi Bommasani Kevin Klyman Shayne Longpre Ashwin Ramaswami ... Victor Storchan Daniel Zhang Daniel E. Ho Percy Liang Arvind Narayanan 31 54 0 27 Feb 2024
A Language Model's Guide Through Latent Space Dimitri von Rutte Sotiris Anagnostidis Gregor Bachmann Thomas Hofmann 45 24 0 22 Feb 2024
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems Zhiyuan Li Hong Liu Denny Zhou Tengyu Ma LRM AI4CE 30 101 0 20 Feb 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 28 12 0 07 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 39 90 0 11 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability Edmund Mills Shiye Su Stuart J. Russell Scott Emmons 56 7 0 20 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Giovanni Monea Maxime Peyrard Martin Josifoski Vishrav Chaudhary Jason Eisner Emre Kiciman Hamid Palangi Barun Patra Robert West KELM 56 12 0 04 Dec 2023