Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.13960
Cited By
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be
27 April 2023
Frederik Kunstner
Jacques Chen
J. Lavington
Mark Schmidt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be"
19 / 19 papers shown
Title
Distributed Sign Momentum with Local Steps for Training Transformers
Shuhua Yu
Ding Zhou
Cong Xie
An Xu
Zhi-Li Zhang
Xin Liu
S. Kar
90
0
0
26 Nov 2024
Deconstructing What Makes a Good Optimizer for Language Models
Rosie Zhao
Depen Morwani
David Brandfonbrener
Nikhil Vyas
Sham Kakade
72
19
0
10 Jul 2024
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
88
6
1
25 May 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
44
18
0
08 Feb 2024
On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions
Yusu Hong
Junhong Lin
76
13
0
06 Feb 2024
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
107
367
0
13 Feb 2023
Stochastic Training is Not Necessary for Generalization
Jonas Geiping
Micah Goldblum
Phillip E. Pope
Michael Moeller
Tom Goldstein
121
75
0
29 Sep 2021
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
Zachary Nado
Justin M. Gilmer
Christopher J. Shallue
Rohan Anil
George E. Dahl
ODL
39
27
0
12 Feb 2021
A Simple Convergence Proof of Adam and Adagrad
Alexandre Défossez
Léon Bottou
Francis R. Bach
Nicolas Usunier
89
150
0
05 Mar 2020
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
126
7,437
0
02 Oct 2019
Introduction to Online Convex Optimization
Elad Hazan
OffRL
90
1,922
0
07 Sep 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Yang You
Jing Li
Sashank J. Reddi
Jonathan Hseu
Sanjiv Kumar
Srinadh Bhojanapalli
Xiaodan Song
J. Demmel
Kurt Keutzer
Cho-Jui Hsieh
ODL
163
991
0
01 Apr 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai
Zhilin Yang
Yiming Yang
J. Carbonell
Quoc V. Le
Ruslan Salakhutdinov
VLM
138
3,714
0
09 Jan 2019
Measuring the Effects of Data Parallelism on Neural Network Training
Christopher J. Shallue
Jaehoon Lee
J. Antognini
J. Mamou
J. Ketterling
Yao Wang
68
408
0
08 Nov 2018
Convergence of Gradient Descent on Separable Data
Mor Shpigel Nacson
Jason D. Lee
Suriya Gunasekar
Pedro H. P. Savarese
Nathan Srebro
Daniel Soudry
58
167
0
05 Mar 2018
Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients
Lukas Balles
Philipp Hennig
62
166
0
22 May 2017
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
166
2,814
0
26 Sep 2016
Layer Normalization
Jimmy Lei Ba
J. Kiros
Geoffrey E. Hinton
246
10,412
0
21 Jul 2016
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
Jian Zhang
Konstantin Lopyrev
Percy Liang
RALM
151
8,067
0
16 Jun 2016
1