ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2008.10898
24
125

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

25 August 2020
Zhize Li
Hongyan Bao
Xiangliang Zhang
Peter Richtárik
    ODL
ArXivPDFHTML
Abstract

In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability ptp_tpt​ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability 1−pt1-p_t1−pt​. We give a simple formula for the optimal choice of ptp_tpt​. Moreover, we prove the first tight lower bound Ω(n+nϵ2)\Omega(n+\frac{\sqrt{n}}{\epsilon^2})Ω(n+ϵ2n​​) for nonconvex finite-sum problems, which also leads to a tight lower bound Ω(b+bϵ2)\Omega(b+\frac{\sqrt{b}}{\epsilon^2})Ω(b+ϵ2b​​) for nonconvex online problems, where b:=min⁡{σ2ϵ2,n}b:= \min\{\frac{\sigma^2}{\epsilon^2}, n\}b:=min{ϵ2σ2​,n}. Then, we show that PAGE obtains the optimal convergence results O(n+nϵ2)O(n+\frac{\sqrt{n}}{\epsilon^2})O(n+ϵ2n​​) (finite-sum) and O(b+bϵ2)O(b+\frac{\sqrt{b}}{\epsilon^2})O(b+ϵ2b​​) (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-\L{}ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate O(⋅log⁡1ϵ)O(\cdot\log \frac{1}{\epsilon})O(⋅logϵ1​). Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

View on arXiv
Comments on this paper