32
5

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Abstract

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of O~(sp(h)SAT)\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T}), where sp(h)\mathrm{sp}(h^*) is the span of the optimal bias function hh^*, S×AS \times A is the size of the state-action space and TT the number of learning steps. Remarkably, our algorithm does not require prior information on sp(h)\mathrm{sp}(h^*). Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

View on arXiv
Comments on this paper