ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2207.06147
31
8

A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP

13 July 2022
Fan Chen
Junyu Zhang
Zaiwen Wen
    OffRL
ArXivPDFHTML
Abstract

As an important framework for safe Reinforcement Learning, the Constrained Markov Decision Process (CMDP) has been extensively studied in the recent literature. However, despite the rich results under various on-policy learning settings, there still lacks some essential understanding of the offline CMDP problems, in terms of both the algorithm design and the information theoretic sample complexity lower bound. In this paper, we focus on solving the CMDP problems where only offline data are available. By adopting the concept of the single-policy concentrability coefficient C∗C^*C∗, we establish an Ω(min⁡{∣S∣∣A∣,∣S∣+I}C∗(1−γ)3ϵ2)\Omega\left(\frac{\min\left\{|\mathcal{S}||\mathcal{A}|,|\mathcal{S}|+I\right\} C^*}{(1-\gamma)^3\epsilon^2}\right)Ω((1−γ)3ϵ2min{∣S∣∣A∣,∣S∣+I}C∗​) sample complexity lower bound for the offline CMDP problem, where III stands for the number of constraints. By introducing a simple but novel deviation control mechanism, we propose a near-optimal primal-dual learning algorithm called DPDL. This algorithm provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an O~((1−γ)−1)\tilde{\mathcal{O}}((1-\gamma)^{-1})O~((1−γ)−1) factor. Comprehensive discussion on how to deal with the unknown constant C∗C^*C∗ and the potential asynchronous structure on the offline dataset are also included.

View on arXiv
Comments on this paper