Maxmin Qlearning: Controlling the Estimation Bias of Qlearning
Abstract
Qlearning suffers from overestimation bias, because it approximates the maximum action value using the maximum estimated action value. Algorithms have been proposed to reduce overestimation bias, but we lack an understanding of how bias interacts with performance, and the extent to which existing algorithms mitigate bias. In this paper, we 1) highlight that the effect of overestimation bias on learning efficiency is environmentdependent; 2) propose a generalization of Qlearning, called \emph{Maxmin Qlearning}, which provides a parameter to flexibly control bias; 3) show theoretically that there exists a parameter choice for Maxmin Qlearning that leads to unbiased estimation with a lower approximation variance than Qlearning; and 4) prove the convergence of our algorithm in the tabular case, as well as convergence of several previous Qlearning variants, using a novel Generalized Qlearning framework. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
 Publication:

arXiv eprints
 Pub Date:
 February 2020
 arXiv:
 arXiv:2002.06487
 Bibcode:
 2020arXiv200206487L
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Artificial Intelligence