Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

doi:10.48550/arXiv.2304.04341

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given $\alpha\in[1/2, 1)$ and $\beta\in[0, \alpha]$, our policy achieves a worst-case expected regret of $\tilde O(T^\alpha)$ (we call it $\alpha$-optimal) and an instance-dependent expected regret of $\tilde O(T^\beta)$ (we call it $\beta$-consistent), while enjoys a probability of incurring an $\tilde O(T^\delta)$ regret ($\delta\geq\alpha$ in the worst-case scenario and $\delta\geq\beta$ in the instance-dependent scenario) that decays exponentially with a polynomial $T$ term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon $T$ is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks.

Publication:

arXiv e-prints

Pub Date:

April 2023

DOI:

10.48550/arXiv.2304.04341

arXiv:

arXiv:2304.04341

Bibcode:

2023arXiv230404341S

Keywords:

Statistics - Machine Learning;
Computer Science - Machine Learning;
Mathematics - Statistics Theory;
Statistics - Methodology

E-Print:

arXiv admin note: text overlap with arXiv:2206.02969

ADS

Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

Abstract