Adaptive Trade-Offs in Off-Policy Learning

doi:10.48550/arXiv.1910.07478

Adaptive Trade-Offs in Off-Policy Learning

A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives of existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.

Publication:

arXiv e-prints

Pub Date:

October 2019

DOI:

10.48550/arXiv.1910.07478

arXiv:

arXiv:1910.07478

Bibcode:

2019arXiv191007478R

Keywords:

Computer Science - Machine Learning;
Statistics - Machine Learning

E-Print:

AISTATS 2020 camera-ready version

NASA/ADS

Adaptive Trade-Offs in Off-Policy Learning

Abstract