ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.06570
16
37

Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

13 February 2023
Matthew Faw
Litu Rout
C. Caramanis
Sanjay Shakkottai
ArXivPDFHTML
Abstract

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of (L0,L1)(L_0,L_1)(L0​,L1​)-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive L0L_0L0​-smoothness. This class is rich enough to include highly non-smooth functions, such as exp⁡(L1x)\exp(L_1 x)exp(L1​x) which is (0,O(L1))(0,\mathcal{O}(L_1))(0,O(L1​))-smooth. Despite the richness, an emerging line of works achieves the O~(1T)\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})O(T​1​) rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the L0L_0L0​-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove O(polylog⁡(T)T)\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})O(T​polylog(T)​) convergence rates for (L0,L1)(L_0,L_1)(L0​,L1​)-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time τ\tauτ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before τ\tauτ as (roughly) independent of the gradients. For general (L0,L1)(L_0,L_1)(L0​,L1​)-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter σ1<1\sigma_1 < 1σ1​<1. For a broad subclass of (L0,L1)(L_0,L_1)(L0​,L1​)-smooth functions, our convergence rate continues to hold when σ1≥1\sigma_1 \geq 1σ1​≥1. By contrast, we prove that many algorithms analyzed by prior works on (L0,L1)(L_0,L_1)(L0​,L1​)-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when σ1>1\sigma_1 > 1σ1​>1.

View on arXiv
Comments on this paper