Model-Based Reinforcement Learning with Value-Targeted Regression

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model belongs to a known family of models , a special case of which is when models in take the form of linear mixtures: . We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form , where , and are the horizon, total number of steps and dimension of , respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound . For a general model family , the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).
View on arXiv