43
71

Efficiently Solving MDPs with Stochastic Mirror Descent

Abstract

We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with AtotA_{tot} total state-action pairs and mixing time bound tmixt_{mix} our method computes an ϵ\epsilon-optimal policy with an expected O~(tmix2Atotϵ2)\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2}) samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a γ\gamma-discounted MDP with AtotA_{tot} total state-action pairs our method computes an ϵ\epsilon-optimal policy with an expected O~((1γ)4Atotϵ2)\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2}) samples, matching the previous state-of-the-art up to a (1γ)1(1-\gamma)^{-1} factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

View on arXiv
Comments on this paper