Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.09871
Cited By
v1
v2 (latest)
A Theory on Adam Instability in Large-Scale Machine Learning
19 April 2023
Igor Molybog
Peter Albert
Moya Chen
Zach DeVito
David Esiobu
Naman Goyal
Punit Singh Koura
Sharan Narang
Andrew Poulton
Ruan Silva
Binh Tang
Diana Liskovich
Puxin Xu
Yuchen Zhang
Melanie Kambadur
Stephen Roller
Susan Zhang
AI4CE
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"A Theory on Adam Instability in Large-Scale Machine Learning"
28 / 28 papers shown
Title
Adaptive Preconditioners Trigger Loss Spikes in Adam
Zhiwei Bai
Zhangchen Zhou
Jiajie Zhao
Xiaolong Li
Zhiyu Li
Feiyu Xiong
Hongkang Yang
Yaoyu Zhang
Z. Xu
ODL
95
0
0
05 Jun 2025
Taming LLMs by Scaling Learning Rates with Gradient Grouping
Siyuan Li
Juanxi Tian
Zedong Wang
Xin Jin
Zicheng Liu
Wentao Zhang
Dan Xu
37
0
0
01 Jun 2025
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
Huishuai Zhang
Bohan Wang
Luoxin Chen
ODL
232
0
0
22 May 2025
Short-Range Dependency Effects on Transformer Instability and a Decomposed Attention Solution
Suvadeep Hajra
97
0
0
21 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
421
1
0
26 Apr 2025
Numerical Error Analysis of Large Language Models
Stanislav Budzinskiy
Wenyi Fang
Longbin Zeng
Philipp Petersen
92
1
0
13 Mar 2025
Stochastic Rounding for LLM Training: Theory and Practice
Kaan Ozkara
Tao Yu
Youngsuk Park
69
0
0
27 Feb 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu Zhang
Gaojie Jin
Xianrui Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
111
2
0
24 Feb 2025
Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective
Sizhuang He
Ananyae Kumar Bhartari
Bowen Li
P. Perdikaris
PINN
135
12
0
02 Feb 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
119
6
0
12 Jan 2025
Beyond Normal: Learning Spatial Density Models of Node Mobility
Wanxin Gao
Ioanis Nikolaidis
Janelle Harms
33
0
0
17 Nov 2024
Methods of improving LLM training stability
Oleg Rybakov
Mike Chrzanowski
Peter Dykas
Jinze Xue
Ben Lanir
80
1
0
22 Oct 2024
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
R. Teo
Tan M. Nguyen
MoE
94
3
0
18 Oct 2024
Scaling Laws For Diffusion Transformers
Zhengyang Liang
Hao He
Ceyuan Yang
Bo Dai
89
14
0
10 Oct 2024
Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization
Igor V. Netay
63
1
0
22 Aug 2024
Anytime-Valid Inference for Double/Debiased Machine Learning of Causal Parameters
Abhinandan Dalal
Patrick Blobaum
S. Kasiviswanathan
Aaditya Ramdas
AI4CE
59
2
0
18 Aug 2024
Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology
Eric Zimmermann
Eugene Vorontsov
Julian Viret
Adam Casson
Michal Zelechowski
...
Razik Yousfi
Thomas J. Fuchs
Nicolò Fusi
Siqi Liu
Kristen Severson
MedIm
97
39
0
01 Aug 2024
GEB-1.3B: Open Lightweight Large Language Model
Jie Wu
Yufeng Zhu
Lei Shen
Xuqing Lu
ALM
44
0
0
14 Jun 2024
Is Flash Attention Stable?
Alicia Golden
Samuel Hsia
Fei Sun
Bilge Acun
Basil Hosmer
...
Zachary DeVito
Jeff Johnson
Gu-Yeon Wei
David Brooks
Carole-Jean Wu
59
5
0
05 May 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
124
57
0
26 Feb 2024
Jointly Training Large Autoregressive Multimodal Models
Emanuele Aiello
L. Yu
Yixin Nie
Armen Aghajanyan
Barlas Oğuz
121
31
0
27 Sep 2023
Small-scale proxies for large-scale Transformer training instabilities
Mitchell Wortsman
Peter J. Liu
Lechao Xiao
Katie Everett
A. Alemi
...
Jascha Narain Sohl-Dickstein
Kelvin Xu
Jaehoon Lee
Justin Gilmer
Simon Kornblith
111
99
0
25 Sep 2023
XGen-7B Technical Report
Erik Nijkamp
Tian Xie
Hiroaki Hayashi
Bo Pang
Congying Xia
...
Chien-Sheng Wu
Silvio Savarese
Yingbo Zhou
Shafiq Joty
Caiming Xiong
ALM
110
13
0
07 Sep 2023
On the Implicit Bias of Adam
M. D. Cattaneo
Jason M. Klusowski
Boris Shigida
82
18
0
31 Aug 2023
Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant
Xianbiao Qi
Jianan Wang
Lei Zhang
50
0
0
15 Jun 2023
ReContrast: Domain-Specific Anomaly Detection via Contrastive Reconstruction
Jia Guo
Shuai Lu
Lize Jia
Weihang Zhang
Huiqi Li
96
31
0
05 Jun 2023
SING: A Plug-and-Play DNN Learning Technique
Adrien Courtois
Damien Scieur
Jean-Michel Morel
Pablo Arias
Thomas Eboli
66
0
0
25 May 2023
Stable and low-precision training for large-scale vision-language models
Mitchell Wortsman
Tim Dettmers
Luke Zettlemoyer
Ari S. Morcos
Ali Farhadi
Ludwig Schmidt
MQ
MLLM
VLM
142
44
0
25 Apr 2023
1