62
16

Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Abstract

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime dγ2>1d\gamma^{2}>1, where dd is the dimension of the feature vector and γ\gamma is the discount rate. In this regime, for any q[γ2,1]q\in[\gamma^{2},1], we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is q/dq/d and it requires Ω(dγ2(qγ2)ε2exp(Θ(dγ2)))\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right) samples to approximate the value function up to an additive error ε\varepsilon. Note that the lower bound of the sample complexity is exponential in dd. If q=γ2q=\gamma^{2}, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most O(max{θπ24ε4logdδ,1ε2(d+log1δ)})O\left(\max\left\{ \frac{\left\Vert \theta^{\pi}\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right)\right\} \right) samples (θπ\theta^{\pi} is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of ε\varepsilon with probability at least 1δ1-\delta.

View on arXiv
Comments on this paper