Bandit Online Optimization Over the Permutahedron

The permutahedron is the convex polytope with vertex set consisting of the vectors for all permutations (bijections) over . We study a bandit game in which, at each step , an adversary chooses a hidden weight weight vector , a player chooses a vertex of the permutahedron and suffers an observed loss of . A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of for a time horizon of . Unfortunately, CombBand requires at each step an -by- matrix permanent approximation to within improved accuracy as grows, resulting in a total running time that is super linear in , making it impractical for large time horizons. We provide an algorithm of regret with total time complexity . The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.
View on arXiv