Regret Bounds for Discounted MDPs

doi:10.48550/arXiv.2002.05138

Regret Bounds for Discounted MDPs

Reinforcement learning (RL) has traditionally been understood from an episodic perspective; the concept of non-episodic RL, where there is no restart and therefore no reliable recovery, remains elusive. A fundamental question in non-episodic RL is how to measure the performance of a learner and derive algorithms to maximize such performance. Conventional wisdom is to maximize the difference between the average reward received by the learner and the maximal long-term average reward. In this paper, we argue that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. We propose a family of measures, called $\gamma$-regret, which we believe to better capture the finite-time optimality. We give motivations and derive lower and upper bounds for such measures. Note: A follow-up work (arXiv:2010.00587) has improved both our lower and upper bound, the gap is now closed at $\tilde{\Theta}\left(\frac{\sqrt{SAT}}{(1 - \gamma)^{\frac{1}{2}}}\right)$.

Publication:

arXiv e-prints

Pub Date:

February 2020

DOI:

10.48550/arXiv.2002.05138

arXiv:

arXiv:2002.05138

Bibcode:

2020arXiv200205138L

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

ADS

Regret Bounds for Discounted MDPs

Abstract