20
10

Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators

Abstract

In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted p\ell_p-norm for each pp in [1,)[1,\infty), with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. Qπ(λ)Q^\pi(\lambda), Tree-Backup(λ)(\lambda), Retrace(λ)(\lambda), and QQ-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for Qπ(λ)Q^\pi(\lambda), Tree-Backup(λ)(\lambda), and Retrace(λ)(\lambda), and improve the best known bounds of QQ-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.

View on arXiv
Comments on this paper