Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1705.07774
Cited By
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
22 May 2017
Lukas Balles
Philipp Hennig
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients"
43 / 43 papers shown
Title
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Akiyoshi Tomihari
Issei Sato
ODL
61
1
0
31 Jan 2025
Distributed Sign Momentum with Local Steps for Training Transformers
Shuhua Yu
Ding Zhou
Cong Xie
An Xu
Zhi-Li Zhang
Xin Liu
S. Kar
79
0
0
26 Nov 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
55
17
0
10 Jul 2024
How to set AdamW's weight decay as you scale model and dataset size
Xi Wang
Laurence Aitchison
51
10
0
22 May 2024
Implicit Bias of AdamW:
ℓ
∞
\ell_\infty
ℓ
∞
Norm Constrained Optimization
Shuo Xie
Zhiyuan Li
OffRL
55
13
0
05 Apr 2024
SignSGD with Federated Voting
Chanho Park
H. Vincent Poor
Namyoon Lee
FedML
42
1
0
25 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
49
26
0
29 Feb 2024
Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization
Zhen Qin
Zhishuai Liu
Pan Xu
31
1
0
24 Oct 2023
Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts
Lizhang Chen
Bo Liu
Kaizhao Liang
Qian Liu
ODL
27
15
0
09 Oct 2023
Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods
A. Ma
Yangchen Pan
Amir-massoud Farahmand
AAML
25
5
0
13 Aug 2023
Combining Explicit and Implicit Regularization for Efficient Learning in Deep Networks
Dan Zhao
23
5
0
01 Jun 2023
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Hong Liu
Zhiyuan Li
David Leo Wright Hall
Percy Liang
Tengyu Ma
VLM
60
132
0
23 May 2023
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods
Junchi Yang
Xiang Li
Ilyas Fatkhullin
Niao He
47
15
0
21 May 2023
Dropout Reduces Underfitting
Zhuang Liu
Zhi-Qin John Xu
Joseph Jin
Zhiqiang Shen
Trevor Darrell
50
36
0
02 Mar 2023
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
69
353
0
13 Feb 2023
A Deep Learning Approach to Generating Photospheric Vector Magnetograms of Solar Active Regions for SOHO/MDI Using SDO/HMI and BBSO Data
Haodi Jiang
Qin Li
Zhihang Hu
Nian Liu
Yasser Abduallah
...
Genwei Zhang
Yan Xu
Wynne Hsu
Jinqiao Wang
Haimin Wang
32
6
0
04 Nov 2022
An Empirical Evaluation of Zeroth-Order Optimization Methods on AI-driven Molecule Optimization
Elvin Lo
Pin-Yu Chen
42
0
0
27 Oct 2022
Robustness to Unbounded Smoothness of Generalized SignSGD
M. Crawshaw
Mingrui Liu
Francesco Orabona
Wei Zhang
Zhenxun Zhuang
AAML
36
66
0
23 Aug 2022
Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks
G. Farhani
Alexander Kazachek
Boyu Wang
27
6
0
29 Jun 2022
Logit Normalization for Long-tail Object Detection
Liang Zhao
Yao Teng
Limin Wang
33
10
0
31 Mar 2022
A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range
Guoqiang Zhang
Kenta Niwa
W. Kleijn
ODL
18
2
0
24 Mar 2022
Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam
Yucheng Lu
Conglong Li
Minjia Zhang
Christopher De Sa
Yuxiong He
OffRL
AI4CE
29
20
0
12 Feb 2022
Extending AdamW by Leveraging Its Second Moment and Magnitude
Guoqiang Zhang
Niwa Kenta
W. Kleijn
14
3
0
09 Dec 2021
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
Difan Zou
Yuan Cao
Yuanzhi Li
Quanquan Gu
MLT
AI4CE
49
40
0
25 Aug 2021
Large Scale Private Learning via Low-rank Reparametrization
Da Yu
Huishuai Zhang
Wei Chen
Jian Yin
Tie-Yan Liu
29
101
0
17 Jun 2021
Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks
Dong-Young Lim
Sotirios Sabanis
41
11
0
28 May 2021
GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training
Chen Zhu
Renkun Ni
Zheng Xu
Kezhi Kong
Wenjie Huang
Tom Goldstein
ODL
41
54
0
16 Feb 2021
A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms
Chao Ma
Lei Wu
E. Weinan
ODL
19
23
0
14 Sep 2020
AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Tyler B. Johnson
Pulkit Agrawal
Haijie Gu
Carlos Guestrin
ODL
30
37
0
09 Jul 2020
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M. Schmidt
Frank Schneider
Philipp Hennig
ODL
47
162
0
03 Jul 2020
MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients
Chenfei Zhu
Yu Cheng
Zhe Gan
Furong Huang
Jingjing Liu
Tom Goldstein
ODL
35
2
0
21 Jun 2020
An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias
Lu Yu
Krishnakumar Balasubramanian
S. Volgushev
Murat A. Erdogdu
42
50
0
14 Jun 2020
LaProp: Separating Momentum and Adaptivity in Adam
Liu Ziyin
Zhikang T.Wang
Masahito Ueda
ODL
13
18
0
12 Feb 2020
ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization
Xiangyi Chen
Sijia Liu
Kaidi Xu
Xingguo Li
Xue Lin
Mingyi Hong
David Cox
ODL
14
105
0
15 Oct 2019
On Empirical Comparisons of Optimizers for Deep Learning
Dami Choi
Christopher J. Shallue
Zachary Nado
Jaehoon Lee
Chris J. Maddison
George E. Dahl
36
256
0
11 Oct 2019
Limitations of the Empirical Fisher Approximation for Natural Gradient Descent
Frederik Kunstner
Lukas Balles
Philipp Hennig
23
209
0
29 May 2019
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
Sai Praneeth Karimireddy
Quentin Rebjock
Sebastian U. Stich
Martin Jaggi
27
493
0
28 Jan 2019
Escaping Saddle Points with Adaptive Gradient Methods
Matthew Staib
Sashank J. Reddi
Satyen Kale
Sanjiv Kumar
S. Sra
ODL
14
73
0
26 Jan 2019
A Sufficient Condition for Convergences of Adam and RMSProp
Fangyu Zou
Li Shen
Zequn Jie
Weizhong Zhang
Wei Liu
33
366
0
23 Nov 2018
signSGD with Majority Vote is Communication Efficient And Fault Tolerant
Jeremy Bernstein
Jiawei Zhao
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
31
46
0
11 Oct 2018
signSGD: Compressed Optimisation for Non-Convex Problems
Jeremy Bernstein
Yu Wang
Kamyar Azizzadenesheli
Anima Anandkumar
FedML
ODL
44
1,021
0
13 Feb 2018
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
N. Keskar
Dheevatsa Mudigere
J. Nocedal
M. Smelyanskiy
P. T. P. Tang
ODL
312
2,896
0
15 Sep 2016
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Hamed Karimi
J. Nutini
Mark Schmidt
139
1,205
0
16 Aug 2016
1