SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

29 May 2025

Ildus Sadrtdinov

Papers citing "SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training"

18 / 18 papers shown

Title
Symbolic Discovery of Optimization Algorithms Xiangning Chen Chen Liang Da Huang Esteban Real Kaiyuan Wang ... Xuanyi Dong Thang Luong Cho-Jui Hsieh Yifeng Lu Quoc V. Le 102 367 0 13 Feb 2023
Stochastic Training is Not Necessary for Generalization Jonas Geiping Micah Goldblum Phillip E. Pope Michael Moeller Tom Goldstein 119 74 0 29 Sep 2021
Learning Transferable Visual Models From Natural Language Supervision Alec Radford Jong Wook Kim Chris Hallacy Aditya A. Ramesh Gabriel Goh ... Amanda Askell Pamela Mishkin Jack Clark Gretchen Krueger Ilya Sutskever CLIP VLM 537 28,659 0 26 Feb 2021
Implicit Gradient Regularization David Barrett Benoit Dherin 42 149 0 23 Sep 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Alexei Baevski Henry Zhou Abdel-rahman Mohamed Michael Auli SSL 101 5,677 0 20 Jun 2020
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 380 41,106 0 28 May 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 380 4,662 0 23 Jan 2020
Big Transfer (BiT): General Visual Representation Learning Alexander Kolesnikov Lucas Beyer Xiaohua Zhai J. Puigcerver Jessica Yung Sylvain Gelly N. Houlsby MQ 191 1,196 0 24 Dec 2019
Reconciling modern machine learning practice and the bias-variance trade-off M. Belkin Daniel J. Hsu Siyuan Ma Soumik Mandal 150 1,628 0 28 Dec 2018
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks Difan Zou Yuan Cao Dongruo Zhou Quanquan Gu ODL 103 448 0 21 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova VLM SSL SSeg 808 93,936 0 11 Oct 2018
Gradient Descent Provably Optimizes Over-parameterized Neural Networks S. Du Xiyu Zhai Barnabás Póczós Aarti Singh MLT ODL 121 1,261 0 04 Oct 2018
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning Siyuan Ma Raef Bassily M. Belkin 41 289 0 18 Dec 2017
Three Factors Influencing Minima in SGD Stanislaw Jastrzebski Zachary Kenton Devansh Arpit Nicolas Ballas Asja Fischer Yoshua Bengio Amos Storkey 55 458 0 13 Nov 2017
Stochastic Gradient Descent as Approximate Bayesian Inference Stephan Mandt Matthew D. Hoffman David M. Blei BDL 32 594 0 13 Apr 2017
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys Pratik Chaudhari A. Choromańska Stefano Soatto Yann LeCun Carlo Baldassi C. Borgs J. Chayes Levent Sagun R. Zecchina ODL 65 769 0 06 Nov 2016
Deep Learning and the Information Bottleneck Principle Naftali Tishby Noga Zaslavsky DRL 89 1,570 0 09 Mar 2015
Practical recommendations for gradient-based training of deep architectures Yoshua Bengio 3DH ODL 105 2,195 0 24 Jun 2012