Adaptive Trade-Offs in Off-Policy Learning
Abstract
A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives of existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2019
- DOI:
- 10.48550/arXiv.1910.07478
- arXiv:
- arXiv:1910.07478
- Bibcode:
- 2019arXiv191007478R
- Keywords:
-
- Computer Science - Machine Learning;
- Statistics - Machine Learning
- E-Print:
- AISTATS 2020 camera-ready version