ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1811.09013
17
71

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

22 November 2018
Ehsan Imani
Eric Graves
Martha White
    OffRL
ArXivPDFHTML
Abstract

Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphaticemphaticemphatic weightingsweightingsweightings. We develop a new actor-critic algorithm\unicodex2014\unicode{x2014}\unicodex2014called Actor Critic with Emphatic weightings (ACE)\unicodex2014\unicode{x2014}\unicodex2014that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods\unicodex2014\unicode{x2014}\unicodex2014particularly OffPAC and DPG\unicodex2014\unicode{x2014}\unicodex2014converge to the wrong solution whereas ACE finds the optimal solution.

View on arXiv
Comments on this paper