16
2

Finite Continuum-Armed Bandits

Abstract

We consider a situation where an agent has TT ressources to be allocated to a larger number NN of actions. Each action can be completed at most once and results in a stochastic reward with unknown mean. The goal of the agent is to maximize her cumulative reward. Non trivial strategies are possible when side information on the actions is available, for example in the form of covariates. Focusing on a nonparametric setting, where the mean reward is an unknown function of a one-dimensional covariate, we propose an optimal strategy for this problem. Under natural assumptions on the reward function, we prove that the optimal regret scales as O(T1/3)O(T^{1/3}) up to poly-logarithmic factors when the budget TT is proportional to the number of actions NN. When TT becomes small compared to NN, a smooth transition occurs. When the ratio T/NT/N decreases from a constant to N1/3N^{-1/3}, the regret increases progressively up to the O(T1/2)O(T^{1/2}) rate encountered in continuum-armed bandits.

View on arXiv
Comments on this paper