32
6

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Abstract

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ϵ\epsilon-optimal policy is Ω~(SAH3/ϵ2)\tilde{\Omega}\left(|S||A|H^3 / \epsilon^2\right) over worst case instances of an MDP with state space SS, action space AA, and horizon HH. We consider a class of MDPs for which the associated optimal QQ^* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in S|S| and A|A| due to the low rank structure, we show that without imposing further assumptions beyond low rank of QQ^*, if one is constrained to estimate the QQ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon HH to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of O~((S+A)poly(d,H)/ϵ2)\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right) for a rank dd setting, which is minimax optimal with respect to the scaling of S,A|S|, |A|, and ϵ\epsilon. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

View on arXiv
Comments on this paper