Communication-Efficient Adaptive Batch Size Strategies for Distributed
Local Gradient Methods

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

20 June 2024

Tim Tsz-Kit Lau

Papers citing "Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods"

19 / 19 papers shown

Title
Nemotron-4 15B Technical Report Jupinder Parmar Shrimai Prabhumoye Joseph Jennings M. Patwary Sandeep Subramanian ... Ashwath Aithal Oleksii Kuchaiev Mohammad Shoeybi Jonathan Cohen Bryan Catanzaro 44 22 0 26 Feb 2024
Asynchronous Local-SGD Training for Language Modeling Bo Liu Rachita Chhaparia Arthur Douillard Satyen Kale Andrei A. Rusu Jiajun Shen Arthur Szlam MarcÁurelio Ranzato FedML 55 11 0 17 Jan 2024
Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays Konstantin Mishchenko Francis R. Bach Mathieu Even Blake E. Woodworth 51 59 0 15 Jun 2022
AdaScale SGD: A User-Friendly Algorithm for Distributed Training Tyler B. Johnson Pulkit Agrawal Haijie Gu Carlos Guestrin ODL 45 37 0 09 Jul 2020
Is Local SGD Better than Minibatch SGD? Blake E. Woodworth Kumar Kshitij Patel Sebastian U. Stich Zhen Dai Brian Bullins H. B. McMahan Ohad Shamir Nathan Srebro FedML 50 254 0 18 Feb 2020
Better Theory for SGD in the Nonconvex World Ahmed Khaled Peter Richtárik 33 182 0 09 Feb 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury ... Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai Soumith Chintala ODL 231 42,038 0 03 Dec 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy M. Lewis Luke Zettlemoyer Veselin Stoyanov AIMat 392 24,160 0 26 Jul 2019
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes Yang You Jing Li Sashank J. Reddi Jonathan Hseu Sanjiv Kumar Srinadh Bhojanapalli Xiaodan Song J. Demmel Kurt Keutzer Cho-Jui Hsieh ODL 161 991 0 01 Apr 2019
Measuring the Effects of Data Parallelism on Neural Network Training Christopher J. Shallue Jaehoon Lee J. Antognini J. Mamou J. Ketterling Yao Wang 68 408 0 08 Nov 2018
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron Sharan Vaswani Francis R. Bach Mark Schmidt 50 297 0 16 Oct 2018
Large Scale Language Modeling: Converging on 40GB of Text in Four Hours Raul Puri Robert M. Kirby Nikolai Yakovenko Bryan Catanzaro 48 29 0 03 Aug 2018
Local SGD Converges Fast and Communicates Little Sebastian U. Stich FedML 150 1,056 0 24 May 2018
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning Siyuan Ma Raef Bassily M. Belkin 48 289 0 18 Dec 2017
Don't Decay the Learning Rate, Increase the Batch Size Samuel L. Smith Pieter-Jan Kindermans Chris Ying Quoc V. Le ODL 93 990 0 01 Nov 2017
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky Jia Deng Hao Su J. Krause S. Satheesh ... A. Karpathy A. Khosla Michael S. Bernstein Alexander C. Berg Li Fei-Fei VLM ObjD 1.1K 39,383 0 01 Sep 2014
One weird trick for parallelizing convolutional neural networks A. Krizhevsky GNN 74 1,297 0 23 Apr 2014
A Proximal Stochastic Gradient Method with Progressive Variance Reduction Lin Xiao Tong Zhang ODL 138 738 0 19 Mar 2014
Hybrid Deterministic-Stochastic Methods for Data Fitting M. Friedlander Mark Schmidt 124 387 0 13 Apr 2011