36
3

Best Policy Identification in Linear MDPs

Abstract

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an ε\varepsilon-optimal policy with probability 1δ1-\delta. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by O(d(ε+Δ)2(log(1δ)+d)){\cal O}({\frac{d}{(\varepsilon+\Delta)^2}} (\log(\frac{1}{\delta})+d)) where Δ\Delta denotes the minimum reward gap of sub-optimal actions and dd is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all δ\delta), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

View on arXiv
Comments on this paper