10
37

Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

Abstract

The current paper studies the problem of agnostic QQ-learning with function approximation in deterministic systems where the optimal QQ-function is approximable by a function in the class F\mathcal{F} with approximation error δ0\delta \ge 0. We propose a novel recursion-based algorithm and show that if δ=O(ρ/dimE)\delta = O\left(\rho/\sqrt{\dim_E}\right), then one can find the optimal policy using O(dimE)O\left(\dim_E\right) trajectories, where ρ\rho is the gap between the optimal QQ-value of the best actions and that of the second-best actions and dimE\dim_E is the Eluder dimension of F\mathcal{F}. Our result has two implications: 1) In conjunction with the lower bound in [Du et al., ICLR 2020], our upper bound suggests that the condition δ=Θ~(ρ/dimE)\delta = \widetilde{\Theta}\left(\rho/\sqrt{\mathrm{dim}_E}\right) is necessary and sufficient for algorithms with polynomial sample complexity. 2) In conjunction with the lower bound in [Wen and Van Roy, NIPS 2013], our upper bound suggests that the sample complexity Θ~(dimE)\widetilde{\Theta}\left(\mathrm{dim}_E\right) is tight even in the agnostic setting. Therefore, we settle the open problem on agnostic QQ-learning proposed in [Wen and Van Roy, NIPS 2013]. We further extend our algorithm to the stochastic reward setting and obtain similar results.

View on arXiv
Comments on this paper