An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits
Abstract
In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to a regret that is logarithmic with respect to the number of arm pulls.
- Publication:
-
Entropy
- Pub Date:
- February 2018
- DOI:
- arXiv:
- arXiv:1710.02869
- Bibcode:
- 2018Entrp..20..155S
- Keywords:
-
- multi-armed bandits;
- exploration;
- exploitation;
- exploration-exploitation dilemma;
- reinforcement learning;
- information theory;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning;
- Statistics - Machine Learning
- E-Print:
- Entropy