9
v2v3 (latest)

Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit Feedback

Haishan Ye
Main:14 Pages
Bibliography:2 Pages
Abstract

We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback.In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points.While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult.In this paper, we resolve this open challenge and provide the first high-probability regret bound of O(d(logT+log(1/δ))/μ)O(d(\log T + \log(1/\delta))/\mu) for μ\mu-strongly convex losses. Our result is minimax optimal with respect to both the time horizon TT and the dimension dd.

View on arXiv
Comments on this paper