Bandit Convex Optimization in Nonstationary Environments
Abstract
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decisionmaking with partial information, where the only feedback available to the player is the onepoint or twopoint function values. In this paper, we investigate BCO in nonstationary environments and choose the \emph{dynamic regret} as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the pathlength of the comparator sequence that reflects the nonstationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the onepoint and twopoint feedback models. The latter result is optimal, matching the $\Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is more adaptive to nonstationary environments since it does not require prior knowledge of the pathlength $P_T$ ahead of time, which is generally unknown.
 Publication:

arXiv eprints
 Pub Date:
 July 2019
 DOI:
 10.48550/arXiv.1907.12340
 arXiv:
 arXiv:1907.12340
 Bibcode:
 2019arXiv190712340Z
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning
 EPrint:
 AISTATS 2020