SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

29 May 2025

Papers citing "SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training"

50 / 55 papers shown

Title
Abrupt Learning in Transformers: A Case Study on Matrix Completion Pulkit Gopalani Ekdeep Singh Lubana Wei Hu 55 3 0 29 Oct 2024
Where Do Large Learning Rates Lead Us? Ildus Sadrtdinov M. Kodryan Eduard Pokonechny E. Lobacheva Dmitry Vetrov AI4CE 61 1 0 29 Oct 2024
Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning Zhuoyan Xu Zhenmei Shi Junyi Wei Fangzhou Mu Yin Li Yingyu Liang 33 24 0 22 Feb 2024
Few-shot Adaptation of Multi-modal Foundation Models: A Survey Fan Liu Tianshu Zhang Wenwen Dai Wenwen Cai Wenwen Cai Xiaocong Zhou Delong Chen VLM OffRL 45 27 0 03 Jan 2024
Why Do We Need Weight Decay in Modern Deep Learning? Maksym Andriushchenko Francesco DÁngelo Aditya Varre Nicolas Flammarion 64 33 0 06 Oct 2023
Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR) Guo-qing Jiang Jinlong Liu Zixiang Ding Lin Guo W. Lin AI4CE 41 2 0 24 Sep 2023
On the different regimes of Stochastic Gradient Descent Antonio Sclocchi Matthieu Wyart 47 20 0 19 Sep 2023
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs Angelica Chen Ravid Schwartz-Ziv Kyunghyun Cho Matthew L. Leavitt Naomi Saphra 41 66 0 13 Sep 2023
Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks F. Chen D. Kunin Atsushi Yamamura Surya Ganguli 68 27 0 07 Jun 2023
Phase transitions in the mini-batch size for sparse and dense two-layer neural networks Raffaele Marino F. Ricci-Tersenghi 41 15 0 10 May 2023
Bayesian Free Energy of Deep ReLU Neural Network in Overparametrized Cases Shuya Nagayasu Sumio Watanabe BDL 63 3 0 28 Mar 2023
Symbolic Discovery of Optimization Algorithms Xiangning Chen Chen Liang Da Huang Esteban Real Kaiyuan Wang ... Xuanyi Dong Thang Luong Cho-Jui Hsieh Yifeng Lu Quoc V. Le 105 367 0 13 Feb 2023
SGD with Large Step Sizes Learns Sparse Features Maksym Andriushchenko Aditya Varre Loucas Pillaud-Vivien Nicolas Flammarion 72 58 0 11 Oct 2022
Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes M. Kodryan E. Lobacheva M. Nakhodnov Dmitry Vetrov 64 17 0 08 Sep 2022
Emergent Abilities of Large Language Models Jason W. Wei Yi Tay Rishi Bommasani Colin Raffel Barret Zoph ... Tatsunori Hashimoto Oriol Vinyals Percy Liang J. Dean W. Fedus ELM ReLM LRM 170 2,428 0 15 Jun 2022
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Aarohi Srivastava Abhinav Rastogi Abhishek Rao Abu Awal Md Shoeb Abubakar Abid ... Zhuoye Zhao Zijian Wang Zijie J. Wang Zirui Wang Ziyi Wu ELM 86 1,726 0 09 Jun 2022
Towards Understanding Grokking: An Effective Theory of Representation Learning Ziming Liu O. Kitouni Niklas Nolte Eric J. Michaud Max Tegmark Mike Williams AI4CE 55 150 0 20 May 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power Yuri Burda Harrison Edwards Igor Babuschkin Vedant Misra 43 347 0 06 Jan 2022
Stochastic Training is Not Necessary for Generalization Jonas Geiping Micah Goldblum Phillip E. Pope Michael Moeller Tom Goldstein 121 74 0 29 Sep 2021
The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion D. Kunin Javier Sagastuy-Breña Lauren Gillespie Eshed Margalit Hidenori Tanaka Surya Ganguli Daniel L. K. Yamins 57 17 0 19 Jul 2021
On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay E. Lobacheva M. Kodryan Nadezhda Chirkova A. Malinin Dmitry Vetrov 54 26 0 29 Jun 2021
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability Jeremy M. Cohen Simran Kaur Yuanzhi Li J. Zico Kolter Ameet Talwalkar ODL 60 259 0 26 Feb 2021
Learning Transferable Visual Models From Natural Language Supervision Alec Radford Jong Wook Kim Chris Hallacy Aditya A. Ramesh Gabriel Goh ... Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever CLIP VLM 671 28,659 0 26 Feb 2021
Strength of Minibatch Noise in SGD Liu Ziyin Kangqiao Liu Takashi Mori Masakuni Ueda ODL MLT 26 35 0 10 Feb 2021
On the Origin of Implicit Regularization in Stochastic Gradient Descent Samuel L. Smith Benoit Dherin David Barrett Soham De MLT 19 202 0 28 Jan 2021
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent Kangqiao Liu Liu Ziyin Masakuni Ueda MLT 80 39 0 07 Dec 2020
Implicit Gradient Regularization David Barrett Benoit Dherin 50 149 0 23 Sep 2020
On the Generalization Benefit of Noise in Stochastic Gradient Descent Samuel L. Smith Erich Elsen Soham De MLT 30 99 0 26 Jun 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski Henry Zhou Abdel-rahman Mohamed Michael Auli SSL 162 5,734 0 20 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 469 41,106 0 28 May 2020
A Free-Energy Principle for Representation Learning Yansong Gao Pratik Chaudhari DRL 18 9 0 27 Feb 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 447 4,662 0 23 Jan 2020
Understanding Why Neural Networks Generalize Well Through GSNR of Parameters Jinlong Liu Guo-qing Jiang Yunzhi Bai Ting Chen Huayan Wang AI4CE 103 50 0 21 Jan 2020
Big Transfer (BiT): General Visual Representation Learning Alexander Kolesnikov Lucas Beyer Xiaohua Zhai J. Puigcerver Jessica Yung Sylvain Gelly N. Houlsby MQ 213 1,196 0 24 Dec 2019
Deep Double Descent: Where Bigger Models and More Data Hurt Preetum Nakkiran Gal Kaplun Yamini Bansal Tristan Yang Boaz Barak Ilya Sutskever 105 925 0 04 Dec 2019
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks Yuanzhi Li Colin Wei Tengyu Ma 32 295 0 10 Jul 2019
Reconciling modern machine learning practice and the bias-variance trade-off M. Belkin Daniel J. Hsu Siyuan Ma Soumik Mandal 172 1,628 0 28 Dec 2018
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks Difan Zou Yuan Cao Dongruo Zhou Quanquan Gu ODL 121 448 0 21 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 943 93,936 0 11 Oct 2018
Gradient Descent Provably Optimizes Over-parameterized Neural Networks S. Du Xiyu Zhai Barnabás Póczós Aarti Singh MLT ODL 137 1,261 0 04 Oct 2018
Fluctuation-dissipation relations for stochastic gradient descent Sho Yaida 59 75 0 28 Sep 2018
TherML: Thermodynamics of Machine Learning Alexander A. Alemi Ian S. Fischer DRL AI4CE 41 28 0 11 Jul 2018
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate Mor Shpigel Nacson Nathan Srebro Daniel Soudry FedML MLT 51 100 0 05 Jun 2018
Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning Yao Zhang Andrew M. Saxe Madhu S. Advani A. Lee 42 60 0 05 Mar 2018
Essentially No Barriers in Neural Network Energy Landscape Felix Dräxler K. Veschgini M. Salmhofer Fred Hamprecht MoMe 97 430 0 02 Mar 2018
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs T. Garipov Pavel Izmailov Dmitrii Podoprikhin Dmitry Vetrov A. Wilson UQCV 56 746 0 27 Feb 2018
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning Siyuan Ma Raef Bassily M. Belkin 48 289 0 18 Dec 2017
Three Factors Influencing Minima in SGD Stanislaw Jastrzebski Zachary Kenton Devansh Arpit Nicolas Ballas Asja Fischer Yoshua Bengio Amos Storkey 67 459 0 13 Nov 2017
Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks Pratik Chaudhari Stefano Soatto MLT 45 303 0 30 Oct 2017
A Bayesian Perspective on Generalization and Stochastic Gradient Descent Samuel L. Smith Quoc V. Le BDL 54 250 0 17 Oct 2017