769

Linearly Parameterized Bandits

Abstract

We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an rr-dimensional random vector ZRr\mathbf{Z} \in \mathbb{R}^r, where r2r \geq 2. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. We propose a policy based on least squares estimation and uncertainty ellipsoids, which generalizes the upper confidence index approach pioneered by Lai and Robbins (1985). The cumulative regret and Bayes risk under our proposed policy admits an upper bound of the form rTlog3/2Tr \sqrt{T} \log^{3/2} T, which is linear in the dimension rr, and independent of the number of arms. We also establish Ω(rT)\Omega (r \sqrt{T}) lower bounds on the regret and risk, showing that our proposed policy is nearly optimal.

View on arXiv
Comments on this paper