16
6

Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Abstract

Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss F(θ)F(\theta) of nn components, we prove that HSDMPG can attain an ϵ\epsilon-optimization-error E[F(θ)F(θ)]ϵ\mathbb{E}[F(\theta)-F(\theta^*)]\leq\epsilon within O(κ1.5ϵ0.75log1.5(1ϵ)+1ϵ(κnlog1.5(1ϵ)+nlog(1ϵ)))\mathcal{O}\Big(\frac{\kappa^{1.5}\epsilon^{0.75}\log^{1.5}(\frac{1}{\epsilon})+1}{\epsilon}\wedge\Big(\kappa \sqrt{n}\log^{1.5}\big(\frac{1}{\epsilon}\big)+n\log\big(\frac{1}{\epsilon}\big)\Big)\Big) stochastic gradient evaluations, where κ\kappa is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of ϵ=O(1/n)\epsilon=\mathcal{O}\big(1/\sqrt{n}\big) which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG for quadratic and generic loss functions are respectively O(n0.875log1.5(n))\mathcal{O} (n^{0.875}\log^{1.5}(n)) and O(n0.875log2.25(n))\mathcal{O} (n^{0.875}\log^{2.25}(n)), which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones.

View on arXiv
Comments on this paper