Linear attention is (maybe) all you need (to understand transformer
optimization)

v1v2 (latest)

Linear attention is (maybe) all you need (to understand transformer optimization)

2 October 2023

ArXiv (abs)PDF HTML

Papers citing "Linear attention is (maybe) all you need (to understand transformer optimization)"

16 / 16 papers shown

Title
Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization Francois Chaubard Mykel J. Kochenderfer MQ AI4CE 182 0 0 23 May 2025
Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking) Yoonsoo Nam Seok Hyeong Lee Clementine Domine Yea Chan Park Charles London Wonyl Choi Niclas Goring Seungjai Lee AI4CE 181 1 0 28 Feb 2025
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis Weronika Ormaniec Felix Dangel Sidak Pal Singh 113 7 0 14 Oct 2024
Spin glass model of in-context learning Yuhao Li Ruoran Bai Haiping Huang LRM 120 0 0 05 Aug 2024
Deconstructing What Makes a Good Optimizer for Language Models Rosie Zhao Depen Morwani David Brandfonbrener Nikhil Vyas Sham Kakade 101 24 0 10 Jul 2024
Does SGD really happen in tiny subspaces? Minhak Song Kwangjun Ahn Chulhee Yun 101 6 1 25 May 2024
On the Role of Attention in Prompt-tuning Samet Oymak A. S. Rawat Mahdi Soltanolkotabi Christos Thrampoulidis MLT LRM 59 46 0 06 Jun 2023
The Crucial Role of Normalization in Sharpness-Aware Minimization Yan Dai Kwangjun Ahn S. Sra 102 19 0 24 May 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity Hongkang Li Ming Wang Sijia Liu Pin-Yu Chen ViT MLT 105 64 0 12 Feb 2023
Transformers learn in-context by gradient descent J. Oswald Eyvind Niklasson E. Randazzo João Sacramento A. Mordvintsev A. Zhmoginov Max Vladymyrov MLT 116 494 0 15 Dec 2022
Learning threshold neurons via the "edge of stability" Kwangjun Ahn Sébastien Bubeck Sinho Chewi Y. Lee Felipe Suarez Yi Zhang MLT 82 41 0 14 Dec 2022
How Does Adaptive Optimization Impact Local Neural Network Geometry? Kaiqi Jiang Dhruv Malik Yuanzhi Li 104 19 0 04 Nov 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes Shivam Garg Dimitris Tsipras Percy Liang Gregory Valiant 141 513 0 01 Aug 2022
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 608 4,893 0 23 Jan 2020
Why gradient clipping accelerates training: A theoretical justification for adaptivity J.N. Zhang Tianxing He S. Sra Ali Jadbabaie 76 467 0 28 May 2019
The Marginal Value of Adaptive Gradient Methods in Machine Learning Ashia Wilson Rebecca Roelofs Mitchell Stern Nathan Srebro Benjamin Recht ODL 71 1,032 0 23 May 2017