Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15196
Cited By
The Marginal Value of Momentum for Small Learning Rate SGD
27 July 2023
Runzhe Wang
Sadhika Malladi
Tianhao Wang
Kaifeng Lyu
Zhiyuan Li
ODL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Marginal Value of Momentum for Small Learning Rate SGD"
14 / 14 papers shown
Title
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Xianliang Li
Jun Luo
Zhiwei Zheng
Hanxiao Wang
Li Luo
Lingkun Wen
Linlong Wu
Sheng Xu
72
0
0
29 Nov 2024
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Atli Kosson
Bettina Messmer
Martin Jaggi
AI4CE
20
2
0
31 Oct 2024
Does SGD really happen in tiny subspaces?
Minhak Song
Kwangjun Ahn
Chulhee Yun
69
4
1
25 May 2024
(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Anh Dang
Reza Babanezhad
Sharan Vaswani
27
0
0
12 Jan 2024
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Rui Pan
Yuxing Liu
Xiaoyu Wang
Tong Zhang
18
5
0
22 Dec 2023
A Quadratic Synchronization Rule for Distributed Deep Learning
Xinran Gu
Kaifeng Lyu
Sanjeev Arora
Jingzhao Zhang
Longbo Huang
51
1
0
22 Oct 2023
Flatter, faster: scaling momentum for optimal speedup of SGD
Aditya Cowsik
T. Can
Paolo Glorioso
57
5
0
28 Oct 2022
A Kernel-Based View of Language Model Fine-Tuning
Sadhika Malladi
Alexander Wettig
Dingli Yu
Danqi Chen
Sanjeev Arora
VLM
68
60
0
11 Oct 2022
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Sadhika Malladi
Kaifeng Lyu
A. Panigrahi
Sanjeev Arora
92
40
0
20 May 2022
Understanding Gradient Descent on Edge of Stability in Deep Learning
Sanjeev Arora
Zhiyuan Li
A. Panigrahi
MLT
80
89
0
19 May 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Zhiyuan Li
Tianhao Wang
Sanjeev Arora
MLT
90
98
0
13 Oct 2021
Making Pre-trained Language Models Better Few-shot Learners
Tianyu Gao
Adam Fisch
Danqi Chen
241
1,919
0
31 Dec 2020
Quasi-hyperbolic momentum and Adam for deep learning
Jerry Ma
Denis Yarats
ODL
84
129
0
16 Oct 2018
A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay
L. Smith
208
1,019
0
26 Mar 2018
1