Online Target Qlearning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs
Abstract
Qlearning is a popular Reinforcement Learning (RL) algorithm which is widely used in practice with function approximation (Mnih et al., 2015). In contrast, existing theoretical results are pessimistic about Qlearning. For example, (Baird, 1995) shows that Qlearning does not converge even with linear function approximation for linear MDPs. Furthermore, even for tabular MDPs with synchronous updates, Qlearning was shown to have suboptimal sample complexity (Li et al., 2021;Azar et al., 2013). The goal of this work is to bridge the gap between practical success of Qlearning and the relatively pessimistic theoretical results. The starting point of our work is the observation that in practice, Qlearning is used with two important modifications: (i) training with two networks, called online network and target network simultaneously (online target learning, or OTL) , and (ii) experience replay (ER) (Mnih et al., 2015). While they have been observed to play a significant role in the practical success of Qlearning, a thorough theoretical understanding of how these two modifications improve the convergence behavior of Qlearning has been missing in literature. By carefully combining Qlearning with OTL and reverse experience replay (RER) (a form of experience replay), we present novel methods QRex and QRexDaRe (QRex + data reuse). We show that QRex efficiently finds the optimal policy for linear MDPs (or more generally for MDPs with zero inherent Bellman error with linear approximation (ZIBEL)) and provide nonasymptotic bounds on sample complexity  the first such result for a Qlearning method for this class of MDPs under standard assumptions. Furthermore, we demonstrate that QRexDaRe in fact achieves near optimal sample complexity in the tabular setting, improving upon the existing results for vanilla Qlearning.
 Publication:

arXiv eprints
 Pub Date:
 October 2021
 arXiv:
 arXiv:2110.08440
 Bibcode:
 2021arXiv211008440A
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control
 EPrint:
 Under Review, V2 has updated acknowledgements