We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of arms indexed from to , each arm is associated with an unknown reward distribution supported on with mean and variance . Assume . We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \textit{grouped median elimination}. The proposed algorithm guarantees to output the best arm with probability and uses at most samples, where () denotes the reward gap between arm and the best arm and we define . This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra factor on the best arm compared with the state-of-the-art. We further show that samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
View on arXiv