ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1602.04951
7
94

Q(λλλ) with Off-Policy Corrections

16 February 2016
Anna Harutyunyan
Marc G. Bellemare
T. Stepleton
Rémi Munos
    OffRL
ArXivPDFHTML
Abstract

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(λ\lambdaλ). We illustrate this theoretical relationship empirically on a continuous-state control task.

View on arXiv
Comments on this paper