ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.01082
  4. Cited By
Linear attention is (maybe) all you need (to understand transformer
  optimization)
v1v2 (latest)

Linear attention is (maybe) all you need (to understand transformer optimization)

2 October 2023
Kwangjun Ahn
Xiang Cheng
Minhak Song
Chulhee Yun
Ali Jadbabaie
S. Sra
ArXiv (abs)PDFHTML

Papers citing "Linear attention is (maybe) all you need (to understand transformer optimization)"

16 / 16 papers shown
Title
Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization
Francois Chaubard
Mykel J. Kochenderfer
MQAI4CE
182
0
0
23 May 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)
Yoonsoo Nam
Seok Hyeong Lee
Clementine Domine
Yea Chan Park
Charles London
Wonyl Choi
Niclas Goring
Seungjai Lee
AI4CE
181
1
0
28 Feb 2025
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
113
7
0
14 Oct 2024
Spin glass model of in-context learning
Spin glass model of in-context learning
Yuhao Li
Ruoran Bai
Haiping Huang
LRM
120
0
0
05 Aug 2024
Deconstructing What Makes a Good Optimizer for Language Models
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
101
24
0
10 Jul 2024
Does SGD really happen in tiny subspaces?
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
101
6
1
25 May 2024
On the Role of Attention in Prompt-tuning
On the Role of Attention in Prompt-tuning
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLTLRM
59
46
0
06 Jun 2023
The Crucial Role of Normalization in Sharpness-Aware Minimization
The Crucial Role of Normalization in Sharpness-Aware Minimization
Yan Dai
Kwangjun Ahn
S. Sra
102
19
0
24 May 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning,
  Generalization, and Sample Complexity
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Hongkang Li
Ming Wang
Sijia Liu
Pin-Yu Chen
ViTMLT
105
64
0
12 Feb 2023
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
116
494
0
15 Dec 2022
Learning threshold neurons via the "edge of stability"
Learning threshold neurons via the "edge of stability"
Kwangjun Ahn
Sébastien Bubeck
Sinho Chewi
Y. Lee
Felipe Suarez
Yi Zhang
MLT
82
41
0
14 Dec 2022
How Does Adaptive Optimization Impact Local Neural Network Geometry?
How Does Adaptive Optimization Impact Local Neural Network Geometry?
Kaiqi Jiang
Dhruv Malik
Yuanzhi Li
104
19
0
04 Nov 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function
  Classes
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Shivam Garg
Dimitris Tsipras
Percy Liang
Gregory Valiant
141
513
0
01 Aug 2022
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
608
4,893
0
23 Jan 2020
Why gradient clipping accelerates training: A theoretical justification
  for adaptivity
Why gradient clipping accelerates training: A theoretical justification for adaptivity
J.N. Zhang
Tianxing He
S. Sra
Ali Jadbabaie
76
467
0
28 May 2019
The Marginal Value of Adaptive Gradient Methods in Machine Learning
The Marginal Value of Adaptive Gradient Methods in Machine Learning
Ashia Wilson
Rebecca Roelofs
Mitchell Stern
Nathan Srebro
Benjamin Recht
ODL
71
1,032
0
23 May 2017
1