Contextual Semibandits via Supervised Learning Oracles

20 February 2015

Abstract

We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowd-sourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, so that we can leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction, which applies when the mapping from feedback to reward is known, leads to a computationally efficient algorithm with a near-optimal regret guarantee. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. We also develop and analyze a novel algorithm for the setting where the linear transformation is unknown.

View on arXiv

Comments on this paper