ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12892
18
3

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

18 February 2025
Thomas Fel
Ekdeep Singh Lubana
Jacob S. Prince
M. Kowal
Victor Boutin
Isabel Papadimitriou
Binxu Wang
Martin Wattenberg
Demba Ba
Talia Konkle
ArXivPDFHTML
Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover "true" classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

View on arXiv
@article{fel2025_2502.12892,
  title={ Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models },
  author={ Thomas Fel and Ekdeep Singh Lubana and Jacob S. Prince and Matthew Kowal and Victor Boutin and Isabel Papadimitriou and Binxu Wang and Martin Wattenberg and Demba Ba and Talia Konkle },
  journal={arXiv preprint arXiv:2502.12892},
  year={ 2025 }
}
Comments on this paper