Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.10652
Cited By
Toy Models of Superposition
21 September 2022
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
Shauna Kravec
Zac Hatfield-Dodds
R. Lasenby
Dawn Drain
Carol Chen
Roger C. Grosse
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Toy Models of Superposition"
32 / 82 papers shown
Title
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
71
7
0
26 May 2024
Securing the Future of GenAI: Policy and Technology
Mihai Christodorescu
Craven
S. Feizi
Neil Zhenqiang Gong
Mia Hoffmann
...
Jessica Newman
Emelia Probasco
Yanjun Qi
Khawaja Shams
Turek
SILM
52
3
0
21 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
45
36
0
06 May 2024
KAN: Kolmogorov-Arnold Networks
Ziming Liu
Yixuan Wang
Sachin Vaidya
Fabian Ruehle
James Halverson
Marin Soljacic
Thomas Y. Hou
Max Tegmark
98
475
0
30 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
46
115
0
28 Mar 2024
Language Models Represent Beliefs of Self and Others
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
50
8
0
28 Feb 2024
Carrying over algorithm in transformers
J. Kruthoff
24
0
0
15 Jan 2024
In-Context Reinforcement Learning for Variable Action Spaces
Viacheslav Sinii
Alexander Nikulin
Vladislav Kurenkov
Ilya Zisman
Sergey Kolesnikov
24
14
0
20 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Tony T. Wang
Miles Wang
Kaivu Hariharan
Nir Shavit
21
2
0
14 Dec 2023
FlexModel: A Framework for Interpretability of Distributed Large Language Models
Matthew Choi
Muhammad Adil Asif
John Willes
David Emerson
AI4CE
ALM
27
1
0
05 Dec 2023
Identifying Linear Relational Concepts in Large Language Models
David Chanin
Anthony Hunter
Oana-Maria Camburu
LLMSV
KELM
23
4
0
15 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
75
7
0
07 Nov 2023
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems
David A. Klindt
Sophia Sanborn
Francisco Acosta
Frédéric Poitevin
Nina Miolane
MILM
FAtt
44
7
0
17 Oct 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
47
142
0
03 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
33
335
0
15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
34
20
0
27 Aug 2023
Identifying Interpretable Subspaces in Image Representations
Neha Kalibhat
S. Bhardwaj
Bayan Bruss
Hamed Firooz
Maziar Sanjabi
S. Feizi
FAtt
42
26
0
20 Jul 2023
Uncovering Unique Concept Vectors through Latent Space Decomposition
Mara Graziani
Laura Mahony
An-phi Nguyen
Henning Muller
Vincent Andrearczyk
43
4
0
13 Jul 2023
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu
Eric Gan
Max Tegmark
26
36
0
04 May 2023
Redundancy and Concept Analysis for Code-trained Language Models
Arushi Sharma
Zefu Hu
Christopher Quinn
Ali Jannesari
73
1
0
01 May 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
Alex Foote
Neel Nanda
Esben Kran
Ionnis Konstas
Fazl Barez
MILM
28
2
0
22 Apr 2023
Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations
Benjamin Ramtoula
Matthew Gadd
Paul Newman
D. Martini
28
10
0
20 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
22
193
0
14 Mar 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
David Lindner
János Kramár
Sebastian Farquhar
Matthew Rahtz
Tom McGrath
Vladimir Mikulik
29
72
0
12 Jan 2023
Circumventing interpretability: How to defeat mind-readers
Lee D. Sharkey
35
3
0
21 Dec 2022
Schrödinger's Bat: Diffusion Models Sometimes Generate Polysemous Words in Superposition
Jennifer C. White
Ryan Cotterell
DiffM
38
5
0
23 Nov 2022
Interpreting Neural Networks through the Polytope Lens
Sid Black
Lee D. Sharkey
Léo Grinsztajn
Eric Winsor
Daniel A. Braun
...
Kip Parker
Carlos Ramón Guevara
Beren Millidge
Gabriel Alfour
Connor Leahy
FAtt
MILM
31
22
0
22 Nov 2022
CRAFT: Concept Recursive Activation FacTorization for Explainability
Thomas Fel
Agustin Picard
Louis Bethune
Thibaut Boissin
David Vigouroux
Julien Colin
Rémi Cadène
Thomas Serre
19
102
0
17 Nov 2022
Engineering Monosemanticity in Toy Models
Adam Jermyn
Nicholas Schiefer
Evan Hubinger
MILM
25
9
0
16 Nov 2022
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
135
25
0
04 Oct 2022
Measuring Self-Supervised Representation Quality for Downstream Classification using Discriminative Features
Neha Kalibhat
Kanika Narang
Hamed Firooz
Maziar Sanjabi
S. Feizi
SSL
38
7
0
03 Mar 2022
Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence
Frederik Pahde
Maximilian Dreyer
Leander Weber
Moritz Weckbecker
Christopher J. Anders
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
60
7
0
07 Feb 2022
Previous
1
2