Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure
Abstract
The multi-armed bandit problems have been studied mainly under the measure of expected total reward accrued over a horizon of length $T$. In this paper, we address the issue of risk in multi-armed bandit problems and develop parallel results under the measure of mean-variance, a commonly adopted risk measure in economics and mathematical finance. We show that the model-specific regret and the model-independent regret in terms of the mean-variance of the reward process are lower bounded by $\Omega(\log T)$ and $\Omega(T^{2/3})$, respectively. We then show that variations of the UCB policy and the DSEE policy developed for the classic risk-neutral MAB achieve these lower bounds.
- Publication:
-
IEEE Journal of Selected Topics in Signal Processing
- Pub Date:
- September 2016
- DOI:
- 10.1109/JSTSP.2016.2592622
- arXiv:
- arXiv:1604.05257
- Bibcode:
- 2016ISTSP..10.1093V
- Keywords:
-
- Computer Science - Machine Learning
- E-Print:
- doi:10.1109/JSTSP.2016.2592622