126
7

Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

Abstract

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ϵ\epsilon-optimal policy is \TildeΩ(SAH3/\eps2)\Tilde{\Omega}\left(|S||A|H^3 / \eps^2\right) over worst case instances of an MDP with state space SS, action space AA, and horizon HH. We consider a class of MDPs that exhibit low rank structure, where the latent features are unknown. We argue that a natural combination of value iteration and low-rank matrix estimation results in an estimation error that grows doubly exponentially in the horizon HH. We then provide a new algorithm along with statistical guarantees that efficiently exploits low rank structure given access to a generative model, achieving a sample complexity of \TildeO(d5(S+A)poly(H)/\eps2)\Tilde{O}\left(d^5(|S|+|A|)\mathrm{poly}(H)/\eps^2\right) for a rank dd setting, which is minimax optimal with respect to the scaling of S,A|S|, |A|, and \eps\eps. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

View on arXiv
Comments on this paper