614

A Simple Convergence Proof of Adam and Adagrad

Abstract

We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations NN. This bound can be made arbitrarily small: Adam with a learning rate α=1/N\alpha=1/\sqrt{N} and a momentum parameter on squared gradients β2=11/N\beta_2=1-1/N achieves the same rate of convergence O(ln(N)/N)O(\ln(N)/\sqrt{N}) as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from O((1β1)3)O((1-\beta_1)^{-3}) to O((1β1)1)O((1-\beta_1)^{-1}). Our technique also improves the best known dependency for standard SGD by a factor 1β11 - \beta_1.

View on arXiv
Comments on this paper