60
1

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

Abstract

We consider the problem of learning a Constrained Markov Decision Process (CMDP) via general parameterization. Our proposed Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm uses entropy and quadratic regularizers to reach this goal. For a parameterized policy class with transferred compatibility approximation error, ϵbias\epsilon_{\mathrm{bias}}, PDR-ANPG achieves a last-iterate ϵ\epsilon optimality gap and ϵ\epsilon constraint violation (up to some additive factor of ϵbias\epsilon_{\mathrm{bias}}) with a sample complexity of O~(ϵ2min{ϵ2,ϵbias13})\tilde{\mathcal{O}}(\epsilon^{-2}\min\{\epsilon^{-2},\epsilon_{\mathrm{bias}}^{-\frac{1}{3}}\}). If the class is incomplete (ϵbias>0\epsilon_{\mathrm{bias}}>0), then the sample complexity reduces to O~(ϵ2)\tilde{\mathcal{O}}(\epsilon^{-2}) for ϵ<(ϵbias)16\epsilon<(\epsilon_{\mathrm{bias}})^{\frac{1}{6}}. Moreover, for complete policies with ϵbias=0\epsilon_{\mathrm{bias}}=0, our algorithm achieves a last-iterate ϵ\epsilon optimality gap and ϵ\epsilon constraint violation with O~(ϵ4)\tilde{\mathcal{O}}(\epsilon^{-4}) sample complexity. It is a significant improvement of the state-of-the-art last-iterate guarantees of general parameterized CMDPs.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.