Extensions of the Multi-Armed Bandit Problem
Abstract
There is a class of stochastic control problems whose structure allows one to calculate the optimal policy in a relatively easy way. The "Bandit Problems" and "Tax Problems" belong to this class. In the bandit problem there are N independent machines. Machine i is described by a sequence X('i)(s), F('i)(s), s (GREATERTHEQ) 1 where X('i)(s) is the immediate reward and F('i)(s) is the information available before i is operated for the s('th) time. At each time one operates exactly one machine; idle machines remain frozen. The problem is to schedule the operation of the machines so as to maximize the expected total discounted reward. An elementary proof shows that to each machine is associated an index, and the optimal policy operates at all times the machine with the largest current index. When the machines are completely observed Markov chains this coincides with the well-known Gittins' index rule, and algorithms are given for calculating the index. A reformulation of the bandit problem yields the tax problem. Using the concept of a superprocess, an index rule is derived for the case where new machines arrive randomly. These results have some applications in computer communication networks. Also we propose a new sufficient condition for the optimality of one-step look-ahead policies. This condition is used in solving some of the extensions of the "mailbox" problem and deterministic bandit problems with multiple servers.
- Publication:
-
Ph.D. Thesis
- Pub Date:
- 1984
- Bibcode:
- 1984PhDT.......166B
- Keywords:
-
- SCHEDULING;
- Physics: Electricity and Magnetism