28
11

Pessimism for Offline Linear Contextual Bandits using p\ell_p Confidence Sets

Abstract

We present a family {π^}p1\{\hat{\pi}\}_{p\ge 1} of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different p\ell_p norms, where π^2\hat{\pi}_2 corresponds to Bellman-consistent pessimism (BCP), while π^\hat{\pi}_\infty is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel π^\hat{\pi}_\infty learning rule is, in a sense, adaptively optimal, as it achieves the minimax performance (up to log factors) against all q\ell_q-constrained problems, and as such it strictly dominates all other predictors in the family, including π^2\hat{\pi}_2.

View on arXiv
Comments on this paper