50
20

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Abstract

We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves O~(S5A2sp(h)T)\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T}) regret after TT steps, where S×AS\times A is the size of state-action space, and sp(h)\mathrm{sp}(h^*) the span of the optimal bias function. Our results are the first to achieve optimal dependence in TT for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an ϵ\epsilon-optimal policy using O~(SAsp2(h)ϵ2+S2Asp(h)ϵ)\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right) samples, whereas the minimax lower bound is Ω(SAsp(h)ϵ2)\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right). Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity O(SA)O(SA).

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.