v1v2 (latest)

Don't Decay the Learning Rate, Increase the Batch Size

1 November 2017

Samuel L. Smith

Pieter-Jan Kindermans

Papers citing "Don't Decay the Learning Rate, Increase the Batch Size"

50 / 454 papers shown

Title
HASFL: Heterogeneity-aware Split Federated Learning over Edge Computing Systems Zheng Lin Zhe Chen Xianhao Chen Wei Ni Yue Gao FedML 34 0 0 10 Jun 2025
A Stable Whitening Optimizer for Efficient Neural Network Training Kevin Frans Sergey Levine Pieter Abbeel 39 0 0 08 Jun 2025
Variational Adaptive Noise and Dropout towards Stable Recurrent Neural Networks Taisuke Kobayashi Shingo Murata 56 0 0 02 Jun 2025
Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training William Merrill Shane Arora Dirk Groeneveld Hannaneh Hajishirzi 55 0 0 29 May 2025
DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models Alex Iacob Lorenzo Sani M. Safaryan Paris Giampouras Samuel Horváth ... Meghdad Kurmanji Preslav Aleksandrov William F. Shen Xinchi Qiu Nicholas D. Lane OffRL 112 0 0 28 May 2025
Variational Deep Learning via Implicit Regularization Jonathan Wenger Beau Coker Juraj Marusic John P. Cunningham OOD UQCV BDL 64 0 0 26 May 2025
A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices Chen Gong Rui Xing Zhenzhe Zheng Fan Wu 68 0 0 22 May 2025
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training Shane Bergsma Nolan Dey Gurpreet Gosal Gavia Gray Daria Soboleva Joel Hestness 80 2 0 19 May 2025
Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients Yezhen Wang Zhouhao Yang Brian K Chen Fanyi Pu Yue Liu Tianyu Gao Kenji Kawaguchi 93 0 0 03 May 2025
A Langevin sampling algorithm inspired by the Adam optimizer Benedict Leimkuhler René Lohmann Peter Whalley 173 0 0 26 Apr 2025
Representation Improvement in Latent Space for Search-Based Testing of Autonomous Robotic Systems D. Humeniuk Foutse Khomh 112 0 0 26 Mar 2025
OmniLearn: A Framework for Distributed Deep Learning over Heterogeneous Clusters S. Tyagi Prateek Sharma 141 0 0 21 Mar 2025
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training Paul Janson Vaibhav Singh Paria Mehrbod Adam Ibrahim Irina Rish Eugene Belilovsky Benjamin Thérien CLL 135 1 0 04 Mar 2025
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs Shane Bergsma Nolan Dey Gurpreet Gosal Gavia Gray Daria Soboleva Joel Hestness 109 8 0 21 Feb 2025
Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent Hikaru Umeda Hideaki Iiduka 160 2 0 17 Feb 2025
Linear Mode Connectivity in Differentiable Tree Ensembles Ryuichi Kanoh M. Sugiyama 241 1 0 17 Feb 2025
On the use of neural networks for the structural characterization of polymeric porous materials Jorge Torre Suset Barroso-Solares M.A. Rodríguez-Pérez Javier Pinto 116 6 0 25 Jan 2025
Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum Keisuke Kamo Hideaki Iiduka 129 0 0 15 Jan 2025
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit Oleg Filatov Jan Ebert Jiangtao Wang Stefan Kesselheim 118 4 0 10 Jan 2025
Towards Precise Scaling Laws for Video Diffusion Transformers Yuanyang Yin Yaqi Zhao Mingwu Zheng Ke Lin Jiarong Ou ... Pengfei Wan Di Zhang Baoqun Yin Wentao Zhang Kun Gai 205 3 0 03 Jan 2025
A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation Xiaoqian Liu Yangfan Du Jiadong Wang Yuan Ge Chen Xu Tong Xiao Guocheng Chen Jingbo Zhu 147 0 0 31 Dec 2024
Weber-Fechner Law in Temporal Difference learning derived from Control as Inference Keiichiro Takahashi Taisuke Kobayashi Tomoya Yamanokuchi Takamitsu Matsubara 76 0 0 31 Dec 2024
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 473 0 0 30 Dec 2024
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs Aldo Pareja Nikhil Shivakumar Nayak Hao Wang Krishnateja Killamsetty Shivchander Sudalairaj ... Guangxuan Xu Kai Xu Ligong Han Luke Inglis Akash Srivastava 200 7 0 17 Dec 2024
Impact of Privacy Parameters on Deep Learning Models for Image Classification Basanta Chaulagain 96 0 0 09 Dec 2024
Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods Jiamian Hu Yuanyuan Hong Yihua Chen He Wang Moriaki Yasuhara 131 1 0 03 Dec 2024
How Does Critical Batch Size Scale in Pre-training? Hanlin Zhang Depen Morwani Nikhil Vyas Jingfeng Wu Difan Zou Udaya Ghai Dean Phillips Foster Sham Kakade 203 18 0 29 Oct 2024
Convergence of Sharpness-Aware Minimization Algorithms using Increasing Batch Size and Decaying Learning Rate Hinata Harada Hideaki Iiduka 65 1 0 16 Sep 2024
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions Sully F. Chen Robert J. Steele Glen M. Hocky Beakal Lemeneh S. Lad Eric Oermann AI4CE 93 0 0 29 Aug 2024
Can Optimization Trajectories Explain Multi-Task Transfer? David Mueller Mark Dredze Nicholas Andrews 142 1 0 26 Aug 2024
Scaling Law with Learning Rate Annealing Howe Tissue Venus Wang Lu Wang 108 9 0 20 Aug 2024
Stochastic weight matrix dynamics during learning and Dyson Brownian motion Gert Aarts B. Lucini Chanju Park 86 1 0 23 Jul 2024
Localizing Anomalies via Multiscale Score Matching Analysis Ahsan Mahmood Junier Oliva M. Styner 51 1 0 28 Jun 2024
Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 92 1 0 20 Jun 2024
Meta-Learning Neural Procedural Biases Christian Raymond Qi Chen Bing Xue Mengjie Zhan 107 1 0 12 Jun 2024
Primitive Agentic First-Order Optimization R. Sala 73 0 0 07 Jun 2024
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations Alexander Hägele Elie Bakouch Atli Kosson Loubna Ben Allal Leandro von Werra Martin Jaggi 127 45 0 28 May 2024
FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information Dongseong Hwang ODL 101 9 0 21 May 2024
High dimensional analysis reveals conservative sharpening and a stochastic edge of stability Atish Agarwala Jeffrey Pennington 112 4 0 30 Apr 2024
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies Shengding Hu Yuge Tu Xu Han Chaoqun He Ganqu Cui ... Chaochao Jia Guoyang Zeng Dahai Li Zhiyuan Liu Maosong Sun MoE 131 347 0 09 Apr 2024
Rolling the dice for better deep learning performance: A study of randomness techniques in deep neural networks Mohammed Ghaith Altarabichi Sławomir Nowaczyk Sepideh Pashami Peyman Sheikholharam Mashhadi Julia Handl 42 11 0 05 Apr 2024
AdaptSFL: Adaptive Split Federated Learning in Resource-constrained Edge Networks Zhengyi Lin Guanqiao Qu Wei Wei Xianhao Chen Kin K. Leung 130 51 0 19 Mar 2024
Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons Simon Dufort-Labbé P. DÓro Evgenii Nikishin Razvan Pascanu Pierre-Luc Bacon A. Baratin 111 1 0 12 Mar 2024
A Tutorial on the Pretrain-Finetune Paradigm for Natural Language Processing Yu Wang Wen Qu 92 0 0 04 Mar 2024
Batch size invariant Adam Xi Wang Laurence Aitchison 89 2 0 29 Feb 2024
Principled Architecture-aware Scaling of Hyperparameters Wuyang Chen Junru Wu Zhangyang Wang Boris Hanin AI4CE 104 0 0 27 Feb 2024
Iteration and Stochastic First-order Oracle Complexities of Stochastic Gradient Descent using Constant and Decaying Learning Rates Kento Imaizumi Hideaki Iiduka 63 2 0 23 Feb 2024
Scaling physics-informed hard constraints with mixture-of-experts N. Chalapathi Yiheng Du Aditi Krishnapriyan AI4CE 102 16 0 20 Feb 2024
AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods Tim Tsz-Kit Lau Han Liu Mladen Kolar ODL 85 6 0 17 Feb 2024
A Framework For Gait-Based User Demography Estimation Using Inertial Sensors C. Swami 40 1 0 15 Feb 2024