Thompson Sampling for Complex Bandit Problems
Abstract
We consider stochastic multiarmed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms' rewards, and the feedback observed may not necessarily be the reward perarm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretelysupported priors over the parameter space and without additional structural properties such as closedform posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that nontrivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets.
 Publication:

arXiv eprints
 Pub Date:
 November 2013
 DOI:
 10.48550/arXiv.1311.0466
 arXiv:
 arXiv:1311.0466
 Bibcode:
 2013arXiv1311.0466G
 Keywords:

 Statistics  Machine Learning;
 Computer Science  Machine Learning