496

Better Communication Complexity for Local SGD

International Conference on Artificial Intelligence and Statistics (AISTATS), 2019
Abstract

We revisit the local Stochastic Gradient Descent (local SGD) method and prove new convergence rates. We close the gap in the theory by showing that it works under unbounded gradients and extend its convergence to weakly convex functions. Furthermore, by changing the assumptions, we manage to get new bounds that explain in what regimes local SGD is faster that its non-local version. For instance, if the objective is strongly convex, we show that, up to constants, it is sufficient to synchronize MM times in total, where MM is the number of nodes. This improves upon the known requirement of Stich (2018) of TM\sqrt{TM} synchronization times in total, where TT is the total number of iterations, which helps to explain the empirical success of local SGD.

View on arXiv
Comments on this paper