ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.12418
16
3

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

25 May 2022
Jiawei Huang
Li Zhao
Tao Qin
Wei Chen
Nan Jiang
Tie-Yan Liu
    OffRL
ArXivPDFHTML
Abstract

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies πO\pi^{\text{O}}πO and πE\pi^{\text{E}}πE: πO\pi^{\text{O}}πO ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while πE\pi^{\text{E}}πE ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., πE=πO\pi^{\text{E}}=\pi^{\text{O}}πE=πO) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce πE\pi^{\text{E}}πE, we can achieve a constant regret for risk-averse users independent of the number of episodes KKK, which is in sharp contrast to the Ω(log⁡K)\Omega(\log K)Ω(logK) regret for any online RL algorithms in the same setting, while the regret of πO\pi^{\text{O}}πO (almost) maintains its online regret optimality and does not need to compromise for the success of πE\pi^{\text{E}}πE.

View on arXiv
Comments on this paper