Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

doi:10.48550/arXiv.2010.03531

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an $(\varepsilon,\delta)$-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

Publication:

arXiv e-prints

Pub Date:

October 2020

DOI:

10.48550/arXiv.2010.03531

arXiv:

arXiv:2010.03531

Bibcode:

2020arXiv201003531D

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

NASA/ADS

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Abstract