First and SecondOrder Bounds for Adversarial Linear Contextual Bandits
Abstract
We consider the adversarial linear contextual bandit setting, which allows for the loss functions associated with each of $K$ arms to change over time without restriction. Assuming the $d$dimensional contexts are drawn from a fixed known distribution, the worstcase expected regret over the course of $T$ rounds is known to scale as $\tilde O(\sqrt{Kd T})$. Under the additional assumption that the density of the contexts is logconcave, we obtain a secondorder bound of order $\tilde O(K\sqrt{d V_T})$ in terms of the cumulative second moment of the learner's losses $V_T$, and a closely related firstorder bound of order $\tilde O(K\sqrt{d L_T^*})$ in terms of the cumulative loss of the best policy $L_T^*$. Since $V_T$ or $L_T^*$ may be significantly smaller than $T$, these improve over the worstcase regret whenever the environment is relatively benign. Our results are obtained using a truncated version of the continuous exponential weights algorithm over the probability simplex, which we analyse by exploiting a novel connection to the linear bandit setting without contexts.
 Publication:

arXiv eprints
 Pub Date:
 May 2023
 DOI:
 10.48550/arXiv.2305.00832
 arXiv:
 arXiv:2305.00832
 Bibcode:
 2023arXiv230500832O
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning