Coordination without communication: optimal regret in two players multi-armed bandits
Abstract
We consider two agents playing simultaneously the same stochastic three-armed bandit problem. The two agents are cooperating but they cannot communicate. We propose a strategy with no collisions at all between the players (with very high probability), and with near-optimal regret $O(\sqrt{T \log(T)})$. We also argue that the extra logarithmic term $\sqrt{\log(T)}$ should be necessary by proving a lower bound for a full information variant of the problem.
- Publication:
-
arXiv e-prints
- Pub Date:
- February 2020
- DOI:
- 10.48550/arXiv.2002.07596
- arXiv:
- arXiv:2002.07596
- Bibcode:
- 2020arXiv200207596B
- Keywords:
-
- Computer Science - Computer Science and Game Theory;
- Computer Science - Machine Learning;
- Computer Science - Multiagent Systems;
- Statistics - Machine Learning
- E-Print:
- 28 pages, 5 figures. V2: minor revision