Katyusha: Accelerated Variance Reduction for Faster SGD
- ODL

We consider minimizing that is an average of convex, smooth functions , and provide the first direct stochastic gradient method that has the accelerated convergence rate. It converges to an -approximate minimizer using stochastic gradients where is the condition number. is a primal-only method, supporting proximal updates, non-Euclidean norm smoothness, mini-batch sampling, as well as non-uniform sampling. It also resolves the following open questions in machine learning If is not strongly convex (e.g., Lasso, logistic regression), gives the first stochastic method that achieves the optimal rate. If is strongly convex and each is "rank-one" (e.g., SVM), gives the first stochastic method that achieves the optimal rate. If is not strongly convex and each is "rank-one" (e.g., L1SVM), gives the first stochastic method that achieves the optimal rate. The main ingredient in is a novel "negative momentum on top of momentum" that can be elegantly coupled with the existing variance reduction trick for stochastic gradient descent. As a result, since variance reduction has been successfully applied to fast growing list of practical problems, our paper implies that one had better hurry up and give a hug in each of them, in hoping for a faster running time also in practice.
View on arXiv