63
8

Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments

Abstract

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound Ω((BSAT(Δc+B2ΔP))1/3K2/3)\Omega((B_{\star} SAT_{\star}(\Delta_c + B_{\star}^2\Delta_P))^{1/3}K^{2/3}), where BB_{\star} is the maximum expected cost of the optimal policy of any episode starting from any state, TT_{\star} is the maximum hitting time of the optimal policy of any episode starting from the initial state, SASA is the number of state-action pairs, Δc\Delta_c and ΔP\Delta_P are the amount of changes of the cost and transition functions respectively, and KK is the number of episodes. The different roles of Δc\Delta_c and ΔP\Delta_P in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of Δc\Delta_c and ΔP\Delta_P, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when Δc\Delta_c and ΔP\Delta_P are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve O~(min{BSALK,(B2S2AT(Δc+BΔP))1/3K2/3})\widetilde{O}(\min\{B_{\star} S\sqrt{ALK}, (B_{\star}^2S^2AT_{\star}(\Delta_c+B_{\star}\Delta_P))^{1/3}K^{2/3}\}) regret, where LL is the unknown number of changes of the environment.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.