54
6

Markov Decision Processes with Long-Term Average Constraints

Abstract

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are KK cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the KK long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with SS states, AA actions, and diameter DD, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by \TildeO(poly(DSA)T)\Tilde{O}(poly(DSA)\sqrt{T}). Further, we show that the violations for any of the KK constraints is also bounded by \TildeO(poly(DSA)T)\Tilde{O}(poly(DSA)\sqrt{T}). To the best of our knowledge, this is the first work which obtains a \TildeO(T)\Tilde{O}(\sqrt{T}) regret bounds for ergodic MDPs with long-term average constraints.

View on arXiv
Comments on this paper