Instancedependent $\ell_\infty$bounds for policy evaluation in tabular reinforcement learning
Abstract
Markov reward processes (MRPs) are used to model stochastic phenomena arising in operations research, control engineering, robotics, and artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the longterm value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of estimating the value function of an infinitehorizon, discounted MRP on finitely many states in the $\ell_\infty$norm. We analyze both the standard plugin approach to this problem and a more robust variant, and establish nonasymptotic bounds that depend on the (unknown) problem instance, as well as datadependent bounds that can be evaluated based on the observations of statetransitions and rewards. We show that these approaches are minimaxoptimal up to constant factors over natural subclasses of MRPs. Our analysis makes use of a leaveoneout decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.
 Publication:

arXiv eprints
 Pub Date:
 September 2019
 arXiv:
 arXiv:1909.08749
 Bibcode:
 2019arXiv190908749P
 Keywords:

 Statistics  Machine Learning;
 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Mathematics  Probability;
 Mathematics  Statistics Theory
 EPrint:
 Version v2 is consistent with manuscript to appear in IEEE Transactions on Information Theory