390

Actively Tracking the Optimal Arm in Non-Stationary Environments with Mandatory Probing

IEEE Transactions on Communications (IEEE Trans. Commun.), 2022
Main:25 Pages
5 Figures
Bibliography:1 Pages
2 Tables
Appendix:4 Pages
Abstract

We study a novel multi-armed bandit (MAB) setting which mandates the agent to probe all the arms periodically in a non-stationary environment. In particular, we develop \texttt{TS-GE} that balances the regret guarantees of classical Thompson sampling (TS) with the broadcast probing (BP) of all the arms simultaneously in order to actively detect a change in the reward distributions. Once a system-level change is detected, the changed arm is identified by an optional subroutine called group exploration (GE) which scales as log2(K)\log_2(K) for a KK-armed bandit setting. We characterize the probability of missed detection and the probability of false-alarm in terms of the environment parameters. The latency of change-detection is upper bounded by T\sqrt{T} while within a period of T\sqrt{T}, all the arms are probed at least once. We highlight the conditions in which the regret guarantee of \texttt{TS-GE} outperforms that of the state-of-the-art algorithms, in particular, \texttt{ADSWITCH} and \texttt{M-UCB}. Furthermore, unlike the existing bandit algorithms, \texttt{TS-GE} can be deployed for applications such as timely status updates, critical control, and wireless energy transfer, which are essential features of next-generation wireless communication networks. We demonstrate the efficacy of \texttt{TS-GE} by employing it in a n industrial internet-of-things (IIoT) network designed for simultaneous wireless information and power transfer (SWIPT).

View on arXiv
Comments on this paper