16
1

Model Selection for Generic Reinforcement Learning

Abstract

We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem where the transition kernel PP^* belongs to a family of models P\mathcal{P}^* with finite metric entropy. In the model selection framework, instead of P\mathcal{P}^*, we are given MM nested families of transition kernels \cP1\cP2\cPM\cP_1 \subset \cP_2 \subset \ldots \subset \cP_M. We propose and analyze a novel algorithm, namely \emph{Adaptive Reinforcement Learning (General)} (\texttt{ARL-GEN}) that adapts to the smallest such family where the true transition kernel PP^* lies. \texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning (\texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts a model selection module at the beginning of each epoch. Under a mild separability assumption on the model classes, we show that \texttt{ARL-GEN} obtains a regret of \TildeO(dEH2+dEMH2T)\Tilde{\mathcal{O}}(d_{\mathcal{E}}^*H^2+\sqrt{d_{\mathcal{E}}^* \mathbb{M}^* H^2 T}), with high probability, where HH is the horizon length, TT is the total number of steps, dEd_{\mathcal{E}}^* is the Eluder dimension and M\mathbb{M}^* is the metric entropy corresponding to P\mathcal{P}^*. Note that this regret scaling matches that of an oracle that knows P\mathcal{P}^* in advance. We show that the cost of model selection for \texttt{ARL-GEN} is an additive term in the regret having a weak dependence on TT. Subsequently, we remove the separability assumption and consider the setup of linear mixture MDPs, where the transition kernel PP^* has a linear function approximation. With this low rank structure, we propose novel adaptive algorithms for model selection, and obtain (order-wise) regret identical to that of an oracle with knowledge of the true model class.

View on arXiv
Comments on this paper