TS-UCB: Improving on Thompson Sampling With Little to No Additional Computation
Abstract
Thompson sampling has become a ubiquitous approach to online decision problems with bandit feedback. The key algorithmic task for Thompson sampling is drawing a sample from the posterior of the optimal action. We propose an alternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance improvements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehensive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoretical perspective, we establish optimal regret guarantees for TS-UCB for both the K-armed and linear bandit models.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2020
- DOI:
- 10.48550/arXiv.2006.06372
- arXiv:
- arXiv:2006.06372
- Bibcode:
- 2020arXiv200606372B
- Keywords:
-
- Computer Science - Machine Learning;
- Statistics - Machine Learning