Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.01082
Cited By
v1
v2 (latest)
Linear attention is (maybe) all you need (to understand transformer optimization)
2 October 2023
Kwangjun Ahn
Xiang Cheng
Minhak Song
Chulhee Yun
Ali Jadbabaie
S. Sra
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Linear attention is (maybe) all you need (to understand transformer optimization)"
16 / 16 papers shown
Title
Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization
Francois Chaubard
Mykel J. Kochenderfer
MQ
AI4CE
182
0
0
23 May 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
Yoonsoo Nam
Seok Hyeong Lee
Clementine Domine
Yea Chan Park
Charles London
Wonyl Choi
Niclas Goring
Seungjai Lee
AI4CE
181
1
0
28 Feb 2025
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
113
7
0
14 Oct 2024
Spin glass model of in-context learning
Yuhao Li
Ruoran Bai
Haiping Huang
LRM
120
0
0
05 Aug 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
101
24
0
10 Jul 2024
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
101
6
1
25 May 2024
On the Role of Attention in Prompt-tuning
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLT
LRM
59
46
0
06 Jun 2023
The Crucial Role of Normalization in Sharpness-Aware Minimization
Yan Dai
Kwangjun Ahn
S. Sra
102
19
0
24 May 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Hongkang Li
Ming Wang
Sijia Liu
Pin-Yu Chen
ViT
MLT
105
64
0
12 Feb 2023
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
116
494
0
15 Dec 2022
Learning threshold neurons via the "edge of stability"
Kwangjun Ahn
Sébastien Bubeck
Sinho Chewi
Y. Lee
Felipe Suarez
Yi Zhang
MLT
82
41
0
14 Dec 2022
How Does Adaptive Optimization Impact Local Neural Network Geometry?
Kaiqi Jiang
Dhruv Malik
Yuanzhi Li
104
19
0
04 Nov 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Shivam Garg
Dimitris Tsipras
Percy Liang
Gregory Valiant
141
513
0
01 Aug 2022
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
608
4,893
0
23 Jan 2020
Why gradient clipping accelerates training: A theoretical justification for adaptivity
J.N. Zhang
Tianxing He
S. Sra
Ali Jadbabaie
76
467
0
28 May 2019
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Ashia Wilson
Rebecca Roelofs
Mitchell Stern
Nathan Srebro
Benjamin Recht
ODL
71
1,032
0
23 May 2017
1