Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.19488
Cited By
Understanding Transformer from the Perspective of Associative Memory
26 May 2025
Shu Zhong
Mingyu Xu
Tenglong Ao
Guang Shi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding Transformer from the Perspective of Associative Memory"
35 / 35 papers shown
Title
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Ali Behrouz
Meisam Razaviyayn
Peilin Zhong
Vahab Mirrokni
82
3
0
17 Apr 2025
MoBA: Mixture of Block Attention for Long-Context LLMs
Enzhe Lu
Z. L. Jiang
Qingbin Liu
Yulun Du
Tao Jiang
...
N. Zhang
Zhilin Yang
Xinyu Zhou
Mingxing Zhang
J. Qiu
80
24
0
18 Feb 2025
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax
Aonian Li
Bangwei Gong
Bo Yang
Bo Shen
...
Zhan Qin
Zhenhua Fan
Zhihang Yu
Z. L. Jiang
Zijia Wu
MoE
125
38
0
14 Jan 2025
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Riccardo Grazzi
Julien N. Siems
Jörg Franke
Arber Zela
Frank Hutter
Massimiliano Pontil
159
23
0
19 Nov 2024
Chain and Causal Attention for Efficient Entity Tracking
Erwan Fagnou
Paul Caillon
Blaise Delattre
Alexandre Allauzen
58
4
0
07 Oct 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
125
146
0
11 Jul 2024
Understanding Transformer Reasoning Capabilities via Graph Algorithms
Clayton Sanford
Bahare Fatemi
Ethan Hall
Anton Tsitsulin
Seyed Mehran Kazemi
Jonathan J. Halcrow
Bryan Perozzi
Vahab Mirrokni
73
35
0
28 May 2024
Length Generalization of Causal Transformers without Position Encoding
Jie Wang
Tao Ji
Yuanbin Wu
Hang Yan
Tao Gui
Qi Zhang
Xuanjing Huang
Xiaoling Wang
VLM
68
21
0
18 Apr 2024
The Illusion of State in State-Space Models
William Merrill
Jackson Petty
Ashish Sabharwal
91
55
0
12 Apr 2024
Mechanistic Design and Scaling of Hybrid Architectures
Michael Poli
Armin W. Thomas
Eric N. D. Nguyen
Pragaash Ponnusamy
Bjorn Deiseroth
...
Brian Hie
Stefano Ermon
Christopher Ré
Ce Zhang
Stefano Massaroli
MoE
98
27
0
26 Mar 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
102
172
0
11 Dec 2023
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng
Jeffrey Quesnelle
Honglu Fan
Enrico Shippole
OSLM
67
255
0
31 Aug 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
LRM
110
1,281
0
17 Jul 2023
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
129
332
0
17 Jul 2023
The Impact of Positional Encoding on Length Generalization in Transformers
Amirhossein Kazemnejad
Inkit Padhi
Karthikeyan N. Ramamurthy
Payel Das
Siva Reddy
70
198
0
31 May 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie
James Lee-Thorp
Michiel de Jong
Yury Zemlyanskiy
Federico Lebrón
Sumit Sanghai
76
664
0
22 May 2023
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford
Jong Wook Kim
Tao Xu
Greg Brockman
C. McLeavey
Ilya Sutskever
OffRL
183
3,655
0
06 Dec 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
316
514
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
181
366
0
21 Sep 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Shivam Garg
Dimitris Tsipras
Percy Liang
Gregory Valiant
139
505
0
01 Aug 2022
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
William Merrill
Ashish Sabharwal
73
112
0
02 Jul 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
224
2,214
0
27 May 2022
NormFormer: Improved Transformer Pretraining with Extra Normalization
Sam Shleifer
Jason Weston
Myle Ott
AI4CE
48
76
0
18 Oct 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su
Yu Lu
Shengfeng Pan
Ahmed Murtadha
Bo Wen
Yunfeng Liu
275
2,453
0
20 Apr 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
923
29,372
0
26 Feb 2021
Linear Transformers Are Secretly Fast Weight Programmers
Imanol Schlag
Kazuki Irie
Jürgen Schmidhuber
117
247
0
22 Feb 2021
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus
Barret Zoph
Noam M. Shazeer
MoE
85
2,181
0
11 Jan 2021
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
645
41,003
0
22 Oct 2020
Hopfield Networks is All You Need
Hubert Ramsauer
Bernhard Schafl
Johannes Lehner
Philipp Seidl
Michael Widrich
...
David P. Kreil
Michael K Kopp
Günter Klambauer
Johannes Brandstetter
Sepp Hochreiter
100
433
0
16 Jul 2020
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Angelos Katharopoulos
Apoorv Vyas
Nikolaos Pappas
Franccois Fleuret
201
1,765
0
29 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
755
41,932
0
28 May 2020
GLU Variants Improve Transformer
Noam M. Shazeer
126
996
0
12 Feb 2020
Root Mean Square Layer Normalization
Biao Zhang
Rico Sennrich
86
733
0
16 Oct 2019
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
698
131,526
0
12 Jun 2017
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam M. Shazeer
Azalia Mirhoseini
Krzysztof Maziarz
Andy Davis
Quoc V. Le
Geoffrey E. Hinton
J. Dean
MoE
248
2,644
0
23 Jan 2017
1