ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.10652
  4. Cited By
Toy Models of Superposition

Toy Models of Superposition

21 September 2022
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
Shauna Kravec
Zac Hatfield-Dodds
R. Lasenby
Dawn Drain
Carol Chen
Roger C. Grosse
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
    AAML
    MILM
ArXivPDFHTML

Papers citing "Toy Models of Superposition"

50 / 75 papers shown
Title
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Embedding Atlas: Low-Friction, Interactive Embedding Visualization
Donghao Ren
Fred Hohman
Halden Lin
Dominik Moritz
23
0
0
09 May 2025
Sparsity is All You Need: Rethinking Biological Pathway-Informed Approaches in Deep Learning
Sparsity is All You Need: Rethinking Biological Pathway-Informed Approaches in Deep Learning
Isabella Caranzano
Corrado Pancotti
Cesare Rollo
Flavio Sartori
Pietro Liò
P. Fariselli
Tiziana Sanavia
OOD
UQCV
60
0
0
07 May 2025
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs
Chetan Pathade
AAML
SILM
59
0
0
07 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J. Zhang
Xipeng Qiu
70
0
0
29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
72
0
0
28 Apr 2025
Representation Learning on a Random Lattice
Representation Learning on a Random Lattice
Aryeh Brill
OOD
FAtt
AI4CE
73
0
0
28 Apr 2025
Naturally Computed Scale Invariance in the Residual Stream of ResNet18
Naturally Computed Scale Invariance in the Residual Stream of ResNet18
André Longon
63
0
0
22 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
Stephanie Gil
VLM
50
0
0
16 Apr 2025
Towards Combinatorial Interpretability of Neural Computation
Towards Combinatorial Interpretability of Neural Computation
Micah Adler
Dan Alistarh
Nir Shavit
FAtt
110
1
0
10 Apr 2025
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Ziwei Ji
L. Yu
Yeskendir Koishekenov
Yejin Bang
Anthony Hartshorn
Alan Schelten
Cheng Zhang
Pascale Fung
Nicola Cancedda
49
1
0
18 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Biwei Huang
Mingming Gong
Anton van den Hengel
Javen Qinfeng Shi
J. Shi
154
0
0
12 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
65
0
0
08 Mar 2025
Strategy Coopetition Explains the Emergence and Transience of In-Context Learning
Aaditya K. Singh
Ted Moskovitz
Sara Dragutinovic
Felix Hill
Stephanie C. Y. Chan
Andrew Saxe
139
0
0
07 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
60
0
0
03 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models
Linear Representations of Political Perspective Emerge in Large Language Models
Junsol Kim
James Evans
Aaron Schein
77
2
0
03 Mar 2025
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Ruta Binkyte
Ivaxi Sheth
Zhijing Jin
Mohammad Havaei
Bernhard Schölkopf
Mario Fritz
128
0
0
28 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
47
0
0
15 Feb 2025
Superpose Singular Features for Model Merging
Superpose Singular Features for Model Merging
Haiquan Qiu
You Wu
Quanming Yao
MoMe
45
0
0
15 Feb 2025
The Complexity of Learning Sparse Superposed Features with Feedback
The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
152
0
0
08 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
111
7
0
06 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song
Zhuoyan Xu
Yiqiao Zhong
85
4
0
31 Dec 2024
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Yang Xu
Y. Wang
Hao Wang
108
1
0
23 Dec 2024
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Konstantin Donhauser
Kristina Ulicna
Gemma Elyse Moran
Aditya Ravuri
Kian Kenyon-Dean
Cian Eastwood
Jason Hartford
76
0
0
20 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
122
1
0
16 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
93
1
0
20 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
Emanuele Marconato
Sébastien Lachapelle
Sebastian Weichwald
Luigi Gresele
69
3
0
30 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
57
9
0
18 Oct 2024
More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing
More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing
Sagi Shaier
Francisco Pereira
K. Wense
Lawrence E Hunter
Matt Jones
MoE
46
0
0
10 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
52
7
0
10 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
Sondre Wold
Barbara Plank
34
0
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
62
10
0
30 Sep 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
2
0
30 Sep 2024
DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
John Wu
David Wu
Jimeng Sun
101
0
0
16 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
26
3
0
06 Sep 2024
On the Complexity of Neural Computation in Superposition
On the Complexity of Neural Computation in Superposition
Micah Adler
Nir Shavit
115
3
0
05 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
On Implications of Scaling Laws on Feature Superposition
On Implications of Scaling Laws on Feature Superposition
Pavan Katta
23
0
0
01 Jul 2024
Memorizing Documents with Guidance in Large Language Models
Memorizing Documents with Guidance in Large Language Models
Bumjin Park
Jaesik Choi
KELM
RALM
36
1
0
23 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
58
13
0
13 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP
Interpreting the Second-Order Effects of Neurons in CLIP
Yossi Gandelsman
Alexei A. Efros
Jacob Steinhardt
MILM
56
16
0
06 Jun 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Tianren Zhang
Chujie Zhao
Guanyu Chen
Yizhou Jiang
Feng Chen
OOD
MLT
OODD
77
3
0
05 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
50
25
0
03 Jun 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
39
6
0
31 May 2024
Knowledge Circuits in Pretrained Transformers
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
64
20
0
28 May 2024
Linguistic Collapse: Neural Collapse in (Large) Language Models
Linguistic Collapse: Neural Collapse in (Large) Language Models
Robert Wu
V. Papyan
48
12
0
28 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
61
7
0
26 May 2024
Securing the Future of GenAI: Policy and Technology
Securing the Future of GenAI: Policy and Technology
Mihai Christodorescu
Craven
S. Feizi
Neil Zhenqiang Gong
Mia Hoffmann
...
Jessica Newman
Emelia Probasco
Yanjun Qi
Khawaja Shams
Turek
SILM
46
3
0
21 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
33
36
0
06 May 2024
KAN: Kolmogorov-Arnold Networks
KAN: Kolmogorov-Arnold Networks
Ziming Liu
Yixuan Wang
Sachin Vaidya
Fabian Ruehle
James Halverson
Marin Soljacic
Thomas Y. Hou
Max Tegmark
80
473
0
30 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
46
111
0
28 Mar 2024
12
Next