39
19

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Abstract

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space S\mathcal{S} and the action space A\mathcal{A} are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with S×A|\mathcal{S}|\times|\mathcal{A}|, which can be prohibitively large when S\mathcal{S} or A\mathcal{A} is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp. ~Q-learning) provably learns an ε\varepsilon-optimal policy (resp. ~Q-function) with high probability as soon as the sample size exceeds the order of K(1γ)3ε2\frac{K}{(1-\gamma)^{3}\varepsilon^{2}} (resp. ~K(1γ)4ε2\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}), up to some logarithmic factor. Here KK is the feature dimension and γ(0,1)\gamma\in(0,1) is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when KK is relatively small, and hence the title of this paper.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.