ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.08607
426
72
v1v2 (latest)

On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method

Neural Information Processing Systems (NeurIPS), 2021
17 February 2021
Junyu Zhang
Chengzhuo Ni
Zheng Yu
Csaba Szepesvári
Mengdi Wang
ArXiv (abs)PDFHTML
Abstract

Policy gradient (PG) gives rise to a rich class of reinforcement learning (RL) methods. Recently, there has been an emerging trend to accelerate the existing PG methods such as REINFORCE by the \emph{variance reduction} techniques. However, all existing variance-reduced PG methods heavily rely on an uncheckable importance weight assumption made for every single iteration of the algorithms. In this paper, a simple gradient truncation mechanism is proposed to address this issue. Moreover, we design a Truncated Stochastic Incremental Variance-Reduced Policy Gradient (TSIVR-PG) method, which is able to maximize not only a cumulative sum of rewards but also a general utility function over a policy's long-term visiting distribution. We show an O~(ϵ−3)\tilde{\mathcal{O}}(\epsilon^{-3})O~(ϵ−3) sample complexity for TSIVR-PG to find an ϵ\epsilonϵ-stationary policy. By assuming the overparameterizaiton of policy and exploiting the hidden convexity of the problem, we further show that TSIVR-PG converges to global ϵ\epsilonϵ-optimal policy with O~(ϵ−2)\tilde{\mathcal{O}}(\epsilon^{-2})O~(ϵ−2) samples.

View on arXiv
Comments on this paper