144
582

Katyusha: Accelerated Variance Reduction for Faster SGD

Abstract

We consider minimizing f(x)f(x) that is an average of nn convex, smooth functions fi(x)f_i(x), and provide the first direct stochastic gradient method Katyusha\mathtt{Katyusha} that has the accelerated convergence rate. It converges to an ε\varepsilon-approximate minimizer using O((n+nκ)logf(x0)f(x)ε)O((n + \sqrt{n \kappa})\cdot \log\frac{f(x_0)-f(x^*)}{\varepsilon}) stochastic gradients where κ\kappa is the condition number. Katyusha\mathtt{Katyusha} is a primal-only method, supporting proximal updates, non-Euclidean norm smoothness, mini-batch sampling, as well as non-uniform sampling. It also resolves the following open questions in machine learning \bullet If f(x)f(x) is not strongly convex (e.g., Lasso, logistic regression), Katyusha\mathtt{Katyusha} gives the first stochastic method that achieves the optimal 1/ε1/\sqrt{\varepsilon} rate. \bullet If f(x)f(x) is strongly convex and each fi(x)f_i(x) is "rank-one" (e.g., SVM), Katyusha\mathtt{Katyusha} gives the first stochastic method that achieves the optimal 1/ε1/\sqrt{\varepsilon} rate. \bullet If f(x)f(x) is not strongly convex and each fi(x)f_i(x) is "rank-one" (e.g., L1SVM), Katyusha\mathtt{Katyusha} gives the first stochastic method that achieves the optimal 1/ε1/\varepsilon rate. The main ingredient in Katyusha\mathtt{Katyusha} is a novel "negative momentum on top of momentum" that can be elegantly coupled with the existing variance reduction trick for stochastic gradient descent. As a result, since variance reduction has been successfully applied to fast growing list of practical problems, our paper implies that one had better hurry up and give Katyusha\mathtt{Katyusha} a hug in each of them, in hoping for a faster running time also in practice.

View on arXiv
Comments on this paper