Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.00213
Cited By
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
31 January 2025
Akiyoshi Tomihari
Issei Sato
ODL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers"
50 / 52 papers shown
Title
Understanding Adam Requires Better Rotation Dependent Assumptions
Lucas Maes
Tianyue H. Zhang
Alexia Jolicoeur-Martineau
Ioannis Mitliagkas
Damien Scieur
Simon Lacoste-Julien
Charles Guille-Escuret
44
3
0
25 Oct 2024
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec
Felix Dangel
Sidak Pal Singh
99
7
0
14 Oct 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
81
22
0
10 Jul 2024
Adam-mini: Use Fewer Learning Rates To Gain More
Yushun Zhang
Congliang Chen
Ziniu Li
Tian Ding
Chenwei Wu
Yinyu Ye
Zhi-Quan Luo
Ruoyu Sun
79
52
0
24 Jun 2024
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
78
12
0
29 May 2024
Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective
Akiyoshi Tomihari
Issei Sato
56
4
0
27 May 2024
Implicit Bias of AdamW:
ℓ
∞
\ell_\infty
ℓ
∞
Norm Constrained Optimization
Shuo Xie
Zhiyuan Li
OffRL
66
20
0
05 Apr 2024
Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Wanyun Cui
Qianle Wang
MQ
62
3
0
03 Apr 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
66
32
0
29 Feb 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
77
53
0
26 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
35
5
0
03 Feb 2024
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
Elan Rosenfeld
Andrej Risteski
47
11
0
07 Nov 2023
Mistral 7B
Albert Q. Jiang
Alexandre Sablayrolles
A. Mensch
Chris Bamford
Devendra Singh Chaplot
...
Teven Le Scao
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LRM
59
2,170
0
10 Oct 2023
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
Lizhang Chen
Bo Liu
Kaizhao Liang
Qian Liu
ODL
48
19
0
09 Oct 2023
Linear attention is (maybe) all you need (to understand transformer optimization)
Kwangjun Ahn
Xiang Cheng
Minhak Song
Chulhee Yun
Ali Jadbabaie
S. Sra
67
51
1
02 Oct 2023
Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Yan Pan
Yuanzhi Li
112
43
0
31 May 2023
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
Frederik Kunstner
Jacques Chen
J. Lavington
Mark Schmidt
75
71
0
27 Apr 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
73
74
0
11 Mar 2023
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
139
373
0
13 Feb 2023
How Does Adaptive Optimization Impact Local Neural Network Geometry?
Kaiqi Jiang
Dhruv Malik
Yuanzhi Li
91
18
0
04 Nov 2022
Scratching Visual Transformer's Back with Uniform Attention
Nam Hyeon-Woo
Kim Yu-Ji
Byeongho Heo
Doonyoon Han
Seong Joon Oh
Tae-Hyun Oh
450
23
0
16 Oct 2022
Robustness to Unbounded Smoothness of Generalized SignSGD
M. Crawshaw
Mingrui Liu
Francesco Orabona
Wei Zhang
Zhenxun Zhuang
AAML
71
72
0
23 Aug 2022
Adam Can Converge Without Any Modification On Update Rules
Yushun Zhang
Congliang Chen
Naichen Shi
Ruoyu Sun
Zhimin Luo
36
67
0
20 Aug 2022
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurelien Lucchi
78
72
0
07 Jun 2022
Revisiting Over-smoothing in BERT from the Perspective of Graph
Han Shi
Jiahui Gao
Hang Xu
Xiaodan Liang
Zhenguo Li
Lingpeng Kong
Stephen M. S. Lee
James T. Kwok
56
74
0
17 Feb 2022
Revisiting Parameter-Efficient Tuning: Are We Really There Yet?
Guanzheng Chen
Fangyu Liu
Zaiqiao Meng
Shangsong Liang
45
93
0
16 Feb 2022
How Do Vision Transformers Work?
Namuk Park
Songkuk Kim
ViT
73
478
0
14 Feb 2022
Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism
Shulun Wang
Bin Liu
Feng Liu
109
16
0
16 Aug 2021
RealFormer: Transformer Likes Residual Attention
Ruining He
Anirudh Ravula
Bhargav Kanagal
Joshua Ainslie
69
109
0
21 Dec 2020
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy
Lucas Beyer
Alexander Kolesnikov
Dirk Weissenborn
Xiaohua Zhai
...
Matthias Minderer
G. Heigold
Sylvain Gelly
Jakob Uszkoreit
N. Houlsby
ViT
530
40,739
0
22 Oct 2020
Big Bird: Transformers for Longer Sequences
Manzil Zaheer
Guru Guruganesh
Kumar Avinava Dubey
Joshua Ainslie
Chris Alberti
...
Philip Pham
Anirudh Ravula
Qifan Wang
Li Yang
Amr Ahmed
VLM
499
2,074
0
28 Jul 2020
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M. Schmidt
Frank Schneider
Philipp Hennig
ODL
69
165
0
03 Jul 2020
Longformer: The Long-Document Transformer
Iz Beltagy
Matthew E. Peters
Arman Cohan
RALM
VLM
128
4,048
0
10 Apr 2020
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
112
988
0
12 Feb 2020
PyHessian: Neural Networks Through the Lens of the Hessian
Z. Yao
A. Gholami
Kurt Keutzer
Michael W. Mahoney
ODL
48
302
0
16 Dec 2019
On Empirical Comparisons of Optimizers for Deep Learning
Dami Choi
Christopher J. Shallue
Zachary Nado
Jaehoon Lee
Chris J. Maddison
George E. Dahl
66
260
0
11 Oct 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
518
24,351
0
26 Jul 2019
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
209
1,592
0
11 Jun 2019
Learning Deep Transformer Models for Machine Translation
Qiang Wang
Bei Li
Tong Xiao
Jingbo Zhu
Changliang Li
Derek F. Wong
Lidia S. Chao
70
670
0
05 Jun 2019
Why gradient clipping accelerates training: A theoretical justification for adaptivity
J.N. Zhang
Tianxing He
S. Sra
Ali Jadbabaie
72
459
0
28 May 2019
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark
Kenton Lee
Ming-Wei Chang
Tom Kwiatkowski
Michael Collins
Kristina Toutanova
205
1,511
0
24 May 2019
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Alex Jinpeng Wang
Yada Pruksachatkun
Nikita Nangia
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
232
2,307
0
02 May 2019
WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
Mohammad Taher Pilehvar
Jose Camacho-Collados
161
485
0
28 Aug 2018
Neural Network Acceptability Judgments
Alex Warstadt
Amanpreet Singh
Samuel R. Bowman
209
1,406
0
31 May 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
884
7,141
0
20 Apr 2018
signSGD: Compressed Optimisation for Non-Convex Problems
Jeremy Bernstein
Yu Wang
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
ODL
87
1,041
0
13 Feb 2018
Block-diagonal Hessian-free Optimization for Training Neural Networks
Huishuai Zhang
Caiming Xiong
James Bradbury
R. Socher
ODL
28
22
0
20 Dec 2017
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
628
130,942
0
12 Jun 2017
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
Lukas Balles
Philipp Hennig
66
168
0
22 May 2017
Deep Residual Learning for Image Recognition
Kaiming He
Xinming Zhang
Shaoqing Ren
Jian Sun
MedIm
1.9K
193,426
0
10 Dec 2015
1
2
Next