We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with states and actions, or linear MDP with anchor points and feature dimension , given the collected episodes data with minimum visiting probability of (anchor) state-action pairs , we obtain nearly horizon -free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by . Specifically: 1. For offline policy evaluation, we obtain an error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on in higher-order term. 2.For offline policy optimization, we obtain an sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by \cite{cui2020plug} that has additional factors in the main term. To the best of our knowledge, these are the \emph{first} set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.
View on arXiv