An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

doi:10.48550/arXiv.1503.04269

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ($\lambda$). Compared to these methods, our _emphatic TD($\lambda$)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

Publication:

arXiv e-prints

Pub Date:

March 2015

DOI:

10.48550/arXiv.1503.04269

arXiv:

arXiv:1503.04269

Bibcode:

2015arXiv150304269S

Keywords:

Computer Science - Machine Learning

E-Print:

29 pages This is a significant revision based on the first set of reviews. The most important change was to signal early that the main result is about stability, not convergence

NASA/ADS

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

Abstract