Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.23489
Cited By
SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training
29 May 2025
Ildus Sadrtdinov
Ivan Klimov
E. Lobacheva
Dmitry Vetrov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training"
18 / 18 papers shown
Title
Symbolic Discovery of Optimization Algorithms
Xiangning Chen
Chen Liang
Da Huang
Esteban Real
Kaiyuan Wang
...
Xuanyi Dong
Thang Luong
Cho-Jui Hsieh
Yifeng Lu
Quoc V. Le
102
367
0
13 Feb 2023
Stochastic Training is Not Necessary for Generalization
Jonas Geiping
Micah Goldblum
Phillip E. Pope
Michael Moeller
Tom Goldstein
119
74
0
29 Sep 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
537
28,659
0
26 Feb 2021
Implicit Gradient Regularization
David Barrett
Benoit Dherin
42
149
0
23 Sep 2020
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski
Henry Zhou
Abdel-rahman Mohamed
Michael Auli
SSL
101
5,677
0
20 Jun 2020
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
380
41,106
0
28 May 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
380
4,662
0
23 Jan 2020
Big Transfer (BiT): General Visual Representation Learning
Alexander Kolesnikov
Lucas Beyer
Xiaohua Zhai
J. Puigcerver
Jessica Yung
Sylvain Gelly
N. Houlsby
MQ
191
1,196
0
24 Dec 2019
Reconciling modern machine learning practice and the bias-variance trade-off
M. Belkin
Daniel J. Hsu
Siyuan Ma
Soumik Mandal
150
1,628
0
28 Dec 2018
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
Difan Zou
Yuan Cao
Dongruo Zhou
Quanquan Gu
ODL
103
448
0
21 Nov 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLM
SSL
SSeg
808
93,936
0
11 Oct 2018
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
S. Du
Xiyu Zhai
Barnabás Póczós
Aarti Singh
MLT
ODL
121
1,261
0
04 Oct 2018
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
Siyuan Ma
Raef Bassily
M. Belkin
41
289
0
18 Dec 2017
Three Factors Influencing Minima in SGD
Stanislaw Jastrzebski
Zachary Kenton
Devansh Arpit
Nicolas Ballas
Asja Fischer
Yoshua Bengio
Amos Storkey
55
458
0
13 Nov 2017
Stochastic Gradient Descent as Approximate Bayesian Inference
Stephan Mandt
Matthew D. Hoffman
David M. Blei
BDL
32
594
0
13 Apr 2017
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari
A. Choromańska
Stefano Soatto
Yann LeCun
Carlo Baldassi
C. Borgs
J. Chayes
Levent Sagun
R. Zecchina
ODL
65
769
0
06 Nov 2016
Deep Learning and the Information Bottleneck Principle
Naftali Tishby
Noga Zaslavsky
DRL
89
1,570
0
09 Mar 2015
Practical recommendations for gradient-based training of deep architectures
Yoshua Bengio
3DH
ODL
105
2,195
0
24 Jun 2012
1