Multi-armed Bandits with Compensation

doi:10.48550/arXiv.1811.01715

Multi-armed Bandits with Compensation

We propose and study the known-compensation multi-arm bandit (KCMAB) problem, where a system controller offers a set of arms to many short-term players for $T$ steps. In each step, one short-term player arrives to the system. Upon arrival, the player aims to select an arm with the current best average reward and receives a stochastic reward associated with the arm. In order to incentivize players to explore other arms, the controller provides a proper payment compensation to players. The objective of the controller is to maximize the total reward collected by players while minimizing the compensation. We first provide a compensation lower bound $\Theta(\sum_i {\Delta_i\log T\over KL_i})$, where $\Delta_i$ and $KL_i$ are the expected reward gap and Kullback-Leibler (KL) divergence between distributions of arm $i$ and the best arm, respectively. We then analyze three algorithms to solve the KCMAB problem, and obtain their regrets and compensations. We show that the algorithms all achieve $O(\log T)$ regret and $O(\log T)$ compensation that match the theoretical lower bound. Finally, we present experimental results to demonstrate the performance of the algorithms.

Publication:

arXiv e-prints

Pub Date:

November 2018

DOI:

10.48550/arXiv.1811.01715

arXiv:

arXiv:1811.01715

Bibcode:

2018arXiv181101715W

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

ADS

Multi-armed Bandits with Compensation

Abstract