In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given arms, instead of maximizing the expected total reward from pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the arms at the end of pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of . We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.
View on arXiv