AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

17 February 2024

Tim Tsz-Kit Lau

Papers citing "AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods"

10 / 10 papers shown

Title
Adaptive Batch Size Schedules for Distributed Training of Language Models with Data and Model Parallelism Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 156 0 0 30 Dec 2024
Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods Tim Tsz-Kit Lau Weijian Li Chenwei Xu Han Liu Mladen Kolar 44 1 0 20 Jun 2024
Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation Shangding Gu Laixi Shi Yuhao Ding Alois Knoll C. Spanos Adam Wierman Ming Jin OffRL 40 2 0 31 May 2024
Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be Frederik Kunstner Jacques Chen J. Lavington Mark W. Schmidt 40 67 0 27 Apr 2023
Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization Raghu Bollapragada Stefan M. Wild 32 11 0 24 Sep 2021
A High Probability Analysis of Adaptive SGD with Momentum Xiaoyun Li Francesco Orabona 92 65 0 28 Jul 2020
A Simple Convergence Proof of Adam and Adagrad Alexandre Défossez Léon Bottou Francis R. Bach Nicolas Usunier 56 143 0 05 Mar 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 264 4,489 0 23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 245 1,826 0 17 Sep 2019
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima N. Keskar Dheevatsa Mudigere J. Nocedal M. Smelyanskiy P. T. P. Tang ODL 308 2,890 0 15 Sep 2016