661

Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit

European Journal of Operational Research (EJOR), 2020
Abstract

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most KK, out of NN items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as O~(κT)\tilde{\mathrm{O}}(\kappa\sqrt{T}), where κ\kappa is an instance dependent constant. κ\kappa could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys O~(T)\tilde{\mathrm{O}}(\sqrt{T}) regret. In our bounds, the κ\kappa factor only affects the poly-log term and not the leading term of the regret bounds.

View on arXiv
Comments on this paper