Linearly Parameterized Bandits
Abstract
We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an -dimensional random vector , where . The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form , which is linear in the dimension , and independent of the number of arms. We also establish lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.
View on arXivComments on this paper
