v1v2 (latest)

SGD with Large Step Sizes Learns Sparse Features

11 October 2022

Maksym Andriushchenko

Aditya Varre

Loucas Pillaud-Vivien

Nicolas Flammarion

ArXiv (abs)PDF HTML Github (32★)

Papers citing "SGD with Large Step Sizes Learns Sparse Features"

50 / 52 papers shown

Title
Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes Ruiqi Zhang Jingfeng Wu Licong Lin Peter L. Bartlett 73 2 0 05 Apr 2025
Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries Chris Kolb T. Weber Bernd Bischl David Rügamer 335 1 0 04 Feb 2025
Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks Amit Peleg Matthias Hein 49 0 0 04 Jul 2024
Fine-tuning with Very Large Dropout Jianyu Zhang Léon Bottou 91 2 0 01 Mar 2024
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit Boaz Barak Benjamin L. Edelman Surbhi Goel Sham Kakade Eran Malach Cyril Zhang 101 133 0 18 Jul 2022
On the Maximum Hessian Eigenvalue and Generalization Simran Kaur Jérémy E. Cohen Zachary Chase Lipton 63 43 0 21 Jun 2022
Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation Loucas Pillaud-Vivien J. Reygner Nicolas Flammarion NoLa 85 34 0 20 Jun 2022
Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions Ning Yang Chao Tang Yuhai Tu MLT 39 22 0 02 Jun 2022
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs Etienne Boursier Loucas Pillaud-Vivien Nicolas Flammarion ODL 55 61 0 02 Jun 2022
On the Benefits of Large Learning Rates for Kernel Methods Gaspard Beugnot Julien Mairal Alessandro Rudi 56 11 0 28 Feb 2022
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework Zhiyuan Li Tianhao Wang Sanjeev Arora MLT 109 105 0 13 Oct 2021
Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect Yuqing Wang Minshuo Chen T. Zhao Molei Tao AI4CE 102 42 0 07 Oct 2021
Stochastic Training is Not Necessary for Generalization Jonas Geiping Micah Goldblum Phillip E. Pope Michael Moeller Tom Goldstein 152 76 0 29 Sep 2021
Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity Arthur Jacot François Ged Berfin cSimcsek Clément Hongler Franck Gabriel 60 55 0 30 Jun 2021
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity Scott Pesme Loucas Pillaud-Vivien Nicolas Flammarion 58 108 0 17 Jun 2021
Label Noise SGD Provably Prefers Flat Global Minimizers Alexandru Damian Tengyu Ma Jason D. Lee NoLa 114 120 0 11 Jun 2021
Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis Stephan Wojtowytsch 72 34 0 04 Jun 2021
Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis Stephan Wojtowytsch 63 51 0 04 May 2021
Acceleration via Fractal Learning Rate Schedules Naman Agarwal Surbhi Goel Cyril Zhang 64 18 0 01 Mar 2021
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability Jeremy M. Cohen Simran Kaur Yuanzhi Li J. Zico Kolter Ameet Talwalkar ODL 97 277 0 26 Feb 2021
Strength of Minibatch Noise in SGD Liu Ziyin Kangqiao Liu Takashi Mori Masakuni Ueda ODL MLT 39 35 0 10 Feb 2021
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks Torsten Hoefler Dan Alistarh Tal Ben-Nun Nikoli Dryden Alexandra Peste MQ 314 725 0 31 Jan 2021
On the Origin of Implicit Regularization in Stochastic Gradient Descent Samuel L. Smith Benoit Dherin David Barrett Soham De MLT 47 204 0 28 Jan 2021
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu Difan Zou Vladimir Braverman Quanquan Gu 79 18 0 04 Nov 2020
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate Zhiyuan Li Kaifeng Lyu Sanjeev Arora 100 75 0 06 Oct 2020
Sharpness-Aware Minimization for Efficiently Improving Generalization Pierre Foret Ariel Kleiner H. Mobahi Behnam Neyshabur AAML 199 1,358 0 03 Oct 2020
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy E. Moroshko Suriya Gunasekar Blake E. Woodworth Jason D. Lee Nathan Srebro Daniel Soudry 73 86 0 13 Jul 2020
Shape Matters: Understanding the Implicit Bias of the Noise Covariance Jeff Z. HaoChen Colin Wei Jason D. Lee Tengyu Ma 178 95 0 15 Jun 2020
Directional convergence and alignment in deep learning Ziwei Ji Matus Telgarsky 59 171 0 11 Jun 2020
Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems Preetum Nakkiran MLT 54 21 0 15 May 2020
The large learning rate phase of deep learning: the catapult mechanism Aitor Lewkowycz Yasaman Bahri Ethan Dyer Jascha Narain Sohl-Dickstein Guy Gur-Ari ODL 202 241 0 04 Mar 2020
Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss Lénaïc Chizat Francis R. Bach MLT 141 341 0 11 Feb 2020
Deep Double Descent: Where Bigger Models and More Data Hurt Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak Ilya Sutskever 123 945 0 04 Dec 2019
An Exponential Learning Rate Schedule for Deep Learning Zhiyuan Li Sanjeev Arora 54 219 0 16 Oct 2019
Implicit Regularization for Optimal Sparse Recovery Tomas Vaskevicius Varun Kanade Patrick Rebeschini 49 103 0 11 Sep 2019
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks Yuanzhi Li Colin Wei Tengyu Ma 58 299 0 10 Jul 2019
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks Kaifeng Lyu Jian Li 98 336 0 13 Jun 2019
Kernel and Rich Regimes in Overparametrized Models Blake E. Woodworth Suriya Gunasekar Pedro H. P. Savarese E. Moroshko Itay Golan Jason D. Lee Daniel Soudry Nathan Srebro 82 366 0 13 Jun 2019
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process Guy Blanc Neha Gupta Gregory Valiant Paul Valiant 147 147 0 19 Apr 2019
On Lazy Training in Differentiable Programming Lénaïc Chizat Edouard Oyallon Francis R. Bach 111 839 0 19 Dec 2018
Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations Qianxiao Li Cheng Tai E. Weinan 104 150 0 05 Nov 2018
Three Mechanisms of Weight Decay Regularization Guodong Zhang Chaoqi Wang Bowen Xu Roger C. Grosse 67 259 0 29 Oct 2018
Understanding Batch Normalization Johan Bjorck Carla P. Gomes B. Selman Kilian Q. Weinberger 150 612 0 01 Jun 2018
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Jonathan Frankle Michael Carbin 263 3,485 0 09 Mar 2018
A Walk with SGD Chen Xing Devansh Arpit Christos Tsirigotis Yoshua Bengio 92 119 0 24 Feb 2018
The Implicit Bias of Gradient Descent on Separable Data Daniel Soudry Elad Hoffer Mor Shpigel Nacson Suriya Gunasekar Nathan Srebro 163 924 0 27 Oct 2017
Sharp Minima Can Generalize For Deep Nets Laurent Dinh Razvan Pascanu Samy Bengio Yoshua Bengio ODL 138 774 0 15 Mar 2017
Understanding deep learning requires rethinking generalization Chiyuan Zhang Samy Bengio Moritz Hardt Benjamin Recht Oriol Vinyals HAI 351 4,635 0 10 Nov 2016
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 431 2,945 0 15 Sep 2016
Deep Residual Learning for Image Recognition Kaiming He Xinming Zhang Shaoqing Ren Jian Sun MedIm 2.2K 194,426 0 10 Dec 2015