Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure

7 June 2022

Abstract

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$ -optimal policy is $\Tilde{\Omega}\left(|S||A|H^3 / \eps^2\right)$ over worst case instances of an MDP with state space $S$ , action space $A$ , and horizon $H$ . We consider a class of MDPs that exhibit low rank structure, where the latent features are unknown. We argue that a natural combination of value iteration and low-rank matrix estimation results in an estimation error that grows doubly exponentially in the horizon $H$ . We then provide a new algorithm along with statistical guarantees that efficiently exploits low rank structure given access to a generative model, achieving a sample complexity of $\Tilde{O}\left(d^5(|S|+|A|)\mathrm{poly}(H)/\eps^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$ , and $\eps$ . In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.

View on arXiv

Comments on this paper