Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning
Abstract
We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.
- Publication:
-
arXiv e-prints
- Pub Date:
- March 2015
- DOI:
- 10.48550/arXiv.1503.09105
- arXiv:
- arXiv:1503.09105
- Bibcode:
- 2015arXiv150309105K
- Keywords:
-
- Mathematics - Dynamical Systems;
- Computer Science - Artificial Intelligence;
- Statistics - Machine Learning
- E-Print:
- 23 pages (relaxed some important assumptions from the previous version), accepted in Mathematics of Operations Research in Feb, 2017