Thompson sampling for linear quadratic mean-field teams
Abstract
We consider optimal control of an unknown multi-agent linear quadratic (LQ) system where the dynamics and the cost are coupled across the agents through the mean-field (i.e., empirical mean) of the states and controls. Directly using single-agent LQ learning algorithms in such models results in regret which increases polynomially with the number of agents. We propose a new Thompson sampling based learning algorithm which exploits the structure of the system model and show that the expected Bayesian regret of our proposed algorithm for a system with agents of $|M|$ different types at time horizon $T$ is $\tilde{\mathcal{O}} \big( |M|^{1.5} \sqrt{T} \big)$ irrespective of the total number of agents, where the $\tilde{\mathcal{O}}$ notation hides logarithmic factors in $T$. We present detailed numerical experiments to illustrate the salient features of the proposed algorithm.
- Publication:
-
arXiv e-prints
- Pub Date:
- November 2020
- DOI:
- 10.48550/arXiv.2011.04686
- arXiv:
- arXiv:2011.04686
- Bibcode:
- 2020arXiv201104686G
- Keywords:
-
- Electrical Engineering and Systems Science - Systems and Control;
- Computer Science - Machine Learning;
- Mathematics - Optimization and Control
- E-Print:
- Submitted to AISTATS 2021