Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

31 July 2024

Papers citing "Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models"

26 / 26 papers shown

Title
Sparsification and Reconstruction from the Perspective of Representation Geometry Wenjie Sun Bingzhe Wu Zhile Yang Chengke Wu 79 0 0 28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 94 1 0 23 May 2025
Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models Zihao Li Xu Wang Yuzhe Yang Ziyu Yao Haoyi Xiong Jundong Li LLMSV LRM 129 3 0 21 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 165 0 0 21 May 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 129 13 0 21 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 90 0 0 21 Mar 2025
Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms Xiaojian Li Yongkang Leng Ruiqing Ding Hangjie Mo Shanlin Yang LRM 80 1 0 15 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Zhen Zhang Biwei Huang Anton van den Hengel Javen Qinfeng Shi Javen Qinfeng Shi 465 1 0 12 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Matthew Wearden Arthur Conmy Arthur Conmy Samuel Marks Neel Nanda MU 181 23 0 12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable Xingyi Yang Constantin Venhoff Ashkan Khakzar Christian Schroeder de Witt P. Dokania Adel Bibi Philip Torr MoE 125 1 0 05 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 149 17 0 23 Feb 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Thomas Fel Ekdeep Singh Lubana Jacob S. Prince M. Kowal Victor Boutin Isabel Papadimitriou Binxu Wang Martin Wattenberg Demba Ba Talia Konkle 81 8 0 18 Feb 2025
The Complexity of Learning Sparse Superposed Features with Feedback Akash Kumar 482 0 0 08 Feb 2025
Can Input Attributions Explain Inductive Reasoning in In-Context Learning? Mengyu Ye Tatsuki Kuribayashi Goro Kobayashi Jun Suzuki LRM 172 0 0 20 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 442 1 0 16 Dec 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 121 11 0 07 Nov 2024
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders Kola Ayonrinde 107 5 0 04 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari 82 6 0 28 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 109 16 0 18 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Constantin Venhoff Anisoara Calinescu Philip Torr Christian Schroeder de Witt 74 0 0 09 Oct 2024
Mechanistic? Naomi Saphra Sarah Wiegreffe AI4CE 80 13 0 07 Oct 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin James Wilken-Smith Tomáš Dulka Hardik Bhatnagar Joseph Bloom Joseph Isaac Bloom 130 37 0 22 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 124 128 0 09 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 132 25 0 02 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 145 39 0 22 Jul 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 182 159 0 28 Mar 2024