Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function
Abstract
We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite stateaction space efficiently. By evaluating the statepair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SAHT})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. } for MDP with $S$ states and $A$ actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(S\sqrt{AHT}) $\citep{fruit2019improved} by a factor of $\sqrt{S}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SAHT}) $\citep{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with a finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{SADT}) $\citep{jaksch2010near}.
 Publication:

arXiv eprints
 Pub Date:
 June 2019
 DOI:
 10.48550/arXiv.1906.05110
 arXiv:
 arXiv:1906.05110
 Bibcode:
 2019arXiv190605110Z
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning