ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.15755
20
7

Confident Approximate Policy Iteration for Efficient Local Planning in qπq^πqπ-realizable MDPs

27 October 2022
Gellert Weisz
András Gyorgy
Tadashi Kozuno
Csaba Szepesvári
ArXivPDFHTML
Abstract

We consider approximate dynamic programming in γ\gammaγ-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon HHH and the worst-case approximation error ϵ\epsilonϵ of the action-value functions of stationary policies. This improvement over API (whose error scales with H2H^2H2) comes at the price of an HHH-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and ddd-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an O~(dHϵ)\tilde O(\sqrt{d}H\epsilon)O~(d​Hϵ)-optimal policy after issuing O~(dH4/ϵ2)\tilde O(dH^4/\epsilon^2)O~(dH4/ϵ2) queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except HHH. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.

View on arXiv
Comments on this paper