38
6

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Abstract

Adaptive gradient methods are arguably the most successful optimization algorithms for neural network training. While it is well-known that adaptive gradient methods can achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In this paper, we aim to close this gap by analyzing the convergence rates of AdaGrad measured by the 1\ell_1-norm of the gradient. Specifically, when the objective has LL-Lipschitz gradient and the stochastic gradient variance is bounded by σ2\sigma^2, we prove a worst-case convergence rate of O~(dLT+dσT1/4)\tilde{\mathcal{O}}(\frac{\sqrt{d}L}{\sqrt{T}} + \frac{\sqrt{d} \sigma}{T^{1/4}}), where dd is the dimension of the problem.We also present a lower bound of Ω(dT){\Omega}(\frac{\sqrt{d}}{\sqrt{T}}) for minimizing the gradient 1\ell_1-norm in the deterministic setting, showing the tightness of our upper bound in the noiseless case. Moreover, under more fine-grained assumptions on the smoothness structure of the objective and the gradient noise and under favorable gradient 1/2\ell_1/\ell_2 geometry, we show that AdaGrad can potentially shave a factor of d\sqrt{d} compared to SGD. To the best of our knowledge, this is the first result for adaptive gradient methods that demonstrates a provable gain over SGD in the non-convex setting.

View on arXiv
Comments on this paper