44
5

Contextual Multi-armed Bandits under Feature Uncertainty

Abstract

We study contextual multi-armed bandit problems under linear realizability on rewards and uncertainty (or noise) on features. For the case of identical noise on features across actions, we propose an algorithm, coined {\em NLinRel}, having O(T78(log(dT)+Kd))O\left(T^{\frac{7}{8}} \left(\log{(dT)}+K\sqrt{d}\right)\right) regret bound for TT rounds, KK actions, and dd-dimensional feature vectors. Next, for the case of non-identical noise, we observe that popular linear hypotheses including {\em NLinRel} are impossible to achieve such sub-linear regret. Instead, under assumption of Gaussian feature vectors, we prove that a greedy algorithm has O(T23logd)O\left(T^{\frac23}\sqrt{\log d}\right) regret bound with respect to the optimal linear hypothesis. Utilizing our theoretical understanding on the Gaussian case, we also design a practical variant of {\em NLinRel}, coined {\em Universal-NLinRel}, for arbitrary feature distributions. It first runs {\em NLinRel} for finding the `true' coefficient vector using feature uncertainties and then adjust it to minimize its regret using the statistical feature information. We justify the performance of {\em Universal-NLinRel} on both synthetic and real-world datasets.

View on arXiv
Comments on this paper