ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2011.13034
19
10

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

25 November 2020
Jingfeng Wu
Vladimir Braverman
Lin F. Yang
ArXivPDFHTML
Abstract

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound O~(min⁡{d,S}⋅H2SAK)\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)O(min{d,S}⋅H2SAK​), where ddd is the number of objectives, SSS is the number of states, AAA is the number of actions, HHH is the length of the horizon, and KKK is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to ϵ\epsilonϵ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity O~(min⁡{d,S}⋅H3SA/ϵ2)\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{\epsilon^2}\bigr)O(min{d,S}⋅H3SA/ϵ2). This result partly resolves an open problem raised by \citet{jin2020reward}.

View on arXiv
Comments on this paper