ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.07456
  4. Cited By
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

9 October 2024
Constantin Venhoff
Anisoara Calinescu
Philip Torr
Christian Schroeder de Witt
ArXiv (abs)PDFHTML

Papers citing "SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders"

13 / 13 papers shown
Title
Measuring Progress in Dictionary Learning for Language Model
  Interpretability with Board Game Models
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
70
31
0
31 Jul 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
133
158
0
28 Mar 2024
Attribution Patching Outperforms Automated Circuit Discovery
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
135
67
0
16 Oct 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
88
125
0
06 Jul 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
66
319
0
28 Apr 2023
Interpreting Neural Networks through the Polytope Lens
Interpreting Neural Networks through the Polytope Lens
Sid Black
Lee D. Sharkey
Léo Grinsztajn
Eric Winsor
Daniel A. Braun
...
Kip Parker
Carlos Ramón Guevara
Beren Millidge
Gabriel Alfour
Connor Leahy
FAttMILM
70
26
0
22 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
310
563
0
01 Nov 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
319
528
0
24 Sep 2022
X-Risk Analysis for AI Research
X-Risk Analysis for AI Research
Dan Hendrycks
Mantas Mazeika
69
71
0
13 Jun 2022
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
880
42,463
0
28 May 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
485
20,342
0
23 Oct 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.8K
95,229
0
11 Oct 2018
Axiomatic Attribution for Deep Networks
Axiomatic Attribution for Deep Networks
Mukund Sundararajan
Ankur Taly
Qiqi Yan
OODFAtt
193
6,024
0
04 Mar 2017
1