Nearly Optimal Adaptive Procedure with Change Detection for PiecewiseStationary Bandit
Abstract
Multiarmed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewisestationary fashion at unknown time steps. We show that by incorporating a simple changedetection component with classic UCB algorithms to detect and adapt to changes, our socalled MUCB algorithm can achieve nearly optimal regret bound on the order of $O(\sqrt{MKT\log T})$, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. Comparison with the best available lower bound shows that our MUCB is nearly optimal in $T$ up to a logarithmic factor. We also compare MUCB with the stateoftheart algorithms in numerical experiments using a public Yahoo! dataset to demonstrate its superior performance.
 Publication:

arXiv eprints
 Pub Date:
 February 2018
 arXiv:
 arXiv:1802.03692
 Bibcode:
 2018arXiv180203692C
 Keywords:

 Statistics  Machine Learning;
 Computer Science  Machine Learning