Improved Optimistic Algorithm For The Multinomial Logit Contextual
Bandit
We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most , out of items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as , where is an instance dependent constant. could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys regret. In our bounds, the factor only affects the poly-log term and not the leading term of the regret bounds.
View on arXiv