Bandit Online Optimization Over the Permutahedron

5 December 2013

Abstract

The permutahedron is the convex polytope with vertex set consisting of the vectors $(\pi(1),\dots, \pi(n))$ for all permutations (bijections) $\pi$ over $\{1,\dots, n\}$ . We study a bandit game in which, at each step $t$ , an adversary chooses a hidden weight weight vector $s_t$ , a player chooses a vertex $\pi_t$ of the permutahedron and suffers an observed loss of $\sum_{i=1}^n \pi(i) s_t(i)$ . A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of $O(n\sqrt{T \log n})$ for a time horizon of $T$ . Unfortunately, CombBand requires at each step an $n$ -by- $n$ matrix permanent approximation to within improved accuracy as $T$ grows, resulting in a total running time that is super linear in $T$ , making it impractical for large time horizons. We provide an algorithm of regret $O(n^{3/2}\sqrt{T})$ with total time complexity $O(n^3T)$ . The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.

View on arXiv

Comments on this paper