Randomized Exploration for Non-Stationary Stochastic Linear Bandits

doi:10.48550/arXiv.1912.05695

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

We investigate two perturbation approaches to overcome conservatism that optimism based algorithms chronically suffer from in practice. The first approach replaces optimism with a simple randomization when using confidence sets. The second one adds random perturbations to its current estimate before maximizing the expected reward. For non-stationary linear bandits, where each action is associated with a $d$-dimensional feature and the unknown parameter is time-varying with total variation $B_T$, we propose two randomized algorithms, Discounted Randomized LinUCB (D-RandLinUCB) and Discounted Linear Thompson Sampling (D-LinTS) via the two perturbation approaches. We highlight the statistical optimality versus computational efficiency trade-off between them in that the former asymptotically achieves the optimal dynamic regret $\tilde{O}(d^{7/8} B_T^{1/4}T^{3/4})$, but the latter is oracle-efficient with an extra logarithmic factor in the number of arms compared to minimax-optimal dynamic regret. In a simulation study, both algorithms show outstanding performance in tackling conservatism issue that Discounted LinUCB struggles with.

Publication:

arXiv e-prints

Pub Date:

December 2019

DOI:

10.48550/arXiv.1912.05695

arXiv:

arXiv:1912.05695

Bibcode:

2019arXiv191205695K

Keywords:

Statistics - Machine Learning;
Computer Science - Machine Learning

E-Print:

An earlier version of this manuscript claimed two perturbation based algorithm and their dynamic regret upper bounds. The argument contained a technical mistake, and the current version presents a fix which deteriorates their dynamic regret bounds from $\tilde{O}(T^{2/3})$ to $\tilde{O}(T^{3/4})$

NASA/ADS

Randomized Exploration for Non-Stationary Stochastic Linear Bandits

Abstract