Towards Unifying Interpretability and Control: Evaluation via Intervention

7 November 2024

Papers citing "Towards Unifying Interpretability and Control: Evaluation via Intervention"

35 / 35 papers shown

Title
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 92 0 0 21 May 2025
Steering off Course: Reliability Challenges in Steering Language Models Patrick Queiroz Da Silva Hari Sethuraman Dheeraj Rajagopal Hannaneh Hajishirzi Sachin Kumar LLMSV 63 1 0 06 Apr 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 70 6 0 03 Mar 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Thomas Fel Ekdeep Singh Lubana Jacob S. Prince M. Kowal Victor Boutin Isabel Papadimitriou Binxu Wang Martin Wattenberg Demba Ba Talia Konkle 51 3 0 18 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 122 9 0 06 Feb 2025
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 64 106 0 09 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size Gemma Team Gemma Team Morgane Riviere Shreya Pathak Pier Giuseppe Sessa Cassidy Hardin ... Noah Fiedel Armand Joulin Kathleen Kenealy Robert Dadashi Alek Andreev VLM MoE OSLM 84 772 0 31 Jul 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen Benjamin Wright Can Rager Rico Angell Jannik Brinkmann Logan Smith C. M. Verdun David Bau Samuel Marks 48 30 0 31 Jul 2024
Relational Composition in Neural Networks: A Survey and Call to Action Martin Wattenberg Fernanda Viégas CoGe 65 9 0 19 Jul 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Senthooran Rajamanoharan Tom Lieberum Nicolas Sonnerat Arthur Conmy Vikrant Varma János Kramár Neel Nanda 43 87 0 19 Jul 2024
Who's asking? User personas and the mechanics of latent misalignment Asma Ghandeharioun Ann Yuan Marius Guerard Emily Reif Michael A. Lepori Lucas Dixon LLMSV 64 8 0 17 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 56 30 0 17 Jun 2024
Designing a Dashboard for Transparency and Control of Conversational AI Yida Chen Aoyu Wu Trevor DePodesta Catherine Yeh Kenneth Li ... Jan Riecke Shivam Raval Olivia Seow Martin Wattenberg Fernanda Viégas 75 17 0 12 Jun 2024
Scaling and evaluating sparse autoencoders Leo Gao Tom Dupré la Tour Henk Tillman Gabriel Goh Rajan Troll Alec Radford Ilya Sutskever Jan Leike Jeffrey Wu 62 134 0 06 Jun 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov Georg Lange Neel Nanda 40 38 0 14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders Senthooran Rajamanoharan Arthur Conmy Lewis Smith Tom Lieberum Vikrant Varma János Kramár Rohin Shah Neel Nanda RALM 47 86 0 24 Apr 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) Usha Bhalla Alexander X. Oesterling Suraj Srinivas Flavio du Pin Calmon Himabindu Lakkaraju 73 38 0 16 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 77 103 0 11 Jan 2024
Steering Llama 2 via Contrastive Activation Addition Nina Rimsky Nick Gabrieli Julian Schulz Meg Tong Evan Hubinger Alexander Matt Turner LLMSV 43 188 0 09 Dec 2023
The Linear Representation Hypothesis and the Geometry of Large Language Models Kiho Park Yo Joong Choe Victor Veitch LLMSV MILM 82 162 0 07 Nov 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 112 199 0 10 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 137 105 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 70 382 0 15 Sep 2023
Linearity of Relation Decoding in Transformer Language Models Evan Hernandez Arnab Sen Sharma Tal Haklay Kevin Meng Martin Wattenberg Jacob Andreas Yonatan Belinkov David Bau KELM 41 95 0 17 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models Hugo Touvron Louis Martin Kevin R. Stone Peter Albert Amjad Almahairi ... Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov Thomas Scialom AI4MH ALM 213 11,636 0 18 Jul 2023
Inspecting and Editing Knowledge Representations in Language Models Evan Hernandez Belinda Z. Li Jacob Andreas KELM 44 84 0 03 Apr 2023
Jump to Conclusions: Short-Cutting Transformers With Linear Transformations Alexander Yom Din Taelin Karidi Leshem Choshen Mor Geva 27 61 0 16 Mar 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 38 213 0 14 Mar 2023
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models Peter Hase Joey Tianyi Zhou Been Kim Asma Ghandeharioun MILM 84 179 0 10 Jan 2023
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task Kenneth Li Aspen K. Hopkins David Bau Fernanda Viégas Hanspeter Pfister Martin Wattenberg MILM 65 280 0 24 Oct 2022
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Mor Geva Avi Caciularu Ke Wang Yoav Goldberg KELM 85 358 0 28 Mar 2022
Causal Abstractions of Neural Networks Atticus Geiger Hanson Lu Thomas Icard Christopher Potts NAI CML 52 234 0 06 Jun 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 236 427 0 24 Feb 2021
Analysis Methods in Neural Language Processing: A Survey Yonatan Belinkov James R. Glass 64 555 0 21 Dec 2018
Understanding intermediate layers using linear classifier probes Guillaume Alain Yoshua Bengio FAtt 100 923 0 05 Oct 2016