$Q$learning with Logarithmic Regret
Abstract
This paper presents the first nonasymptotic result showing that a modelfree algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive suboptimality gap in the optimal $Q$function. We prove that the optimistic $Q$learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\mathrm{gap}_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\mathrm{gap}_{\min}$ is the minimum suboptimality gap. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.
 Publication:

arXiv eprints
 Pub Date:
 June 2020
 arXiv:
 arXiv:2006.09118
 Bibcode:
 2020arXiv200609118Y
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning