ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.06178
86
0

Reusing Trajectories in Policy Gradients Enables Fast Convergence

6 June 2025
Alessandro Montenegro
Federico Mansutti
Marco Mussi
Matteo Papini
Alberto Maria Metelli
    OnRL
ArXiv (abs)PDFHTML
Main:10 Pages
9 Figures
Bibliography:3 Pages
7 Tables
Appendix:28 Pages
Abstract

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require O(ϵ−2)O(\epsilon^{-2})O(ϵ−2) trajectories to reach an ϵ\epsilonϵ-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of O(ϵ−3/2)O(\epsilon^{-3/2})O(ϵ−3/2), the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of O~(ϵ−1)\widetilde{O}(\epsilon^{-1})O(ϵ−1), the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.

View on arXiv
Comments on this paper