Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.04430
Cited By
Towards Unifying Interpretability and Control: Evaluation via Intervention
7 November 2024
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Unifying Interpretability and Control: Evaluation via Intervention"
35 / 35 papers shown
Title
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
92
0
0
21 May 2025
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
Hari Sethuraman
Dheeraj Rajagopal
Hannaneh Hajishirzi
Sachin Kumar
LLMSV
63
1
0
06 Apr 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
Sai Sumedh R. Hindupur
Ekdeep Singh Lubana
Thomas Fel
Demba Ba
70
6
0
03 Mar 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Thomas Fel
Ekdeep Singh Lubana
Jacob S. Prince
M. Kowal
Victor Boutin
Isabel Papadimitriou
Binxu Wang
Martin Wattenberg
Demba Ba
Talia Konkle
51
3
0
18 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
122
9
0
06 Feb 2025
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
64
106
0
09 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
84
772
0
31 Jul 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
48
30
0
31 Jul 2024
Relational Composition in Neural Networks: A Survey and Call to Action
Martin Wattenberg
Fernanda Viégas
CoGe
65
9
0
19 Jul 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
43
87
0
19 Jul 2024
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
64
8
0
17 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky
Philippe Chlenski
Neel Nanda
56
30
0
17 Jun 2024
Designing a Dashboard for Transparency and Control of Conversational AI
Yida Chen
Aoyu Wu
Trevor DePodesta
Catherine Yeh
Kenneth Li
...
Jan Riecke
Shivam Raval
Olivia Seow
Martin Wattenberg
Fernanda Viégas
75
17
0
12 Jun 2024
Scaling and evaluating sparse autoencoders
Leo Gao
Tom Dupré la Tour
Henk Tillman
Gabriel Goh
Rajan Troll
Alec Radford
Ilya Sutskever
Jan Leike
Jeffrey Wu
62
134
0
06 Jun 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
40
38
0
14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Tom Lieberum
Vikrant Varma
János Kramár
Rohin Shah
Neel Nanda
RALM
47
86
0
24 Apr 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
73
38
0
16 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
77
103
0
11 Jan 2024
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
43
188
0
09 Dec 2023
The Linear Representation Hypothesis and the Geometry of Large Language Models
Kiho Park
Yo Joong Choe
Victor Veitch
LLMSV
MILM
82
162
0
07 Nov 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
112
199
0
10 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
137
105
0
27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
70
382
0
15 Sep 2023
Linearity of Relation Decoding in Transformer Language Models
Evan Hernandez
Arnab Sen Sharma
Tal Haklay
Kevin Meng
Martin Wattenberg
Jacob Andreas
Yonatan Belinkov
David Bau
KELM
41
95
0
17 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
213
11,636
0
18 Jul 2023
Inspecting and Editing Knowledge Representations in Language Models
Evan Hernandez
Belinda Z. Li
Jacob Andreas
KELM
44
84
0
03 Apr 2023
Jump to Conclusions: Short-Cutting Transformers With Linear Transformations
Alexander Yom Din
Taelin Karidi
Leshem Choshen
Mor Geva
27
61
0
16 Mar 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
38
213
0
14 Mar 2023
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase
Joey Tianyi Zhou
Been Kim
Asma Ghandeharioun
MILM
84
179
0
10 Jan 2023
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
Kenneth Li
Aspen K. Hopkins
David Bau
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
MILM
65
280
0
24 Oct 2022
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
Mor Geva
Avi Caciularu
Ke Wang
Yoav Goldberg
KELM
85
358
0
28 Mar 2022
Causal Abstractions of Neural Networks
Atticus Geiger
Hanson Lu
Thomas Icard
Christopher Potts
NAI
CML
52
234
0
06 Jun 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
236
427
0
24 Feb 2021
Analysis Methods in Neural Language Processing: A Survey
Yonatan Belinkov
James R. Glass
64
555
0
21 Dec 2018
Understanding intermediate layers using linear classifier probes
Guillaume Alain
Yoshua Bengio
FAtt
100
923
0
05 Oct 2016
1