The ODE Method for Asymptotic Statistics in Stochastic Approximation and Reinforcement Learning
Abstract
The paper concerns the $d$-dimensional stochastic approximation recursion, $$ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) $$ where $ \{ \Phi_n \}$ is a stochastic process on a general state space, satisfying a conditional Markov property that allows for parameter-dependent noise. The main results are established under additional conditions on the mean flow and a version of the Donsker-Varadhan Lyapunov drift condition known as (DV3): (i) An appropriate Lyapunov function is constructed that implies convergence of the estimates in $L_4$. (ii) A functional central limit theorem (CLT) is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $\textsf{E}[ z_n z_n^T ]$ to the asymptotic covariance in the CLT, where $z_n =: (\theta_n-\theta^*)/\sqrt{\alpha_n}$. (iii) The CLT holds for the normalized version $z^{\text{PR}}_n =: \sqrt{n} [\theta^{\text{PR}}_n -\theta^*]$, of the averaged parameters $\theta^{\text{PR}}_n =:n^{-1} \sum_{k=1}^n\theta_k$, subject to standard assumptions on the step-size. Moreover, the covariance in the CLT coincides with the minimal covariance of Polyak and Ruppert. (iv) An example is given where $f$ and $\bar{f}$ are linear in $\theta$, and $\Phi$ is a geometrically ergodic Markov chain but does not satisfy (DV3). While the algorithm is convergent, the second moment of $\theta_n$ is unbounded and in fact diverges. This arXiv version represents a major extension of the results in prior versions.The main results now allow for parameter-dependent noise, as is often the case in applications to reinforcement learning.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2021
- DOI:
- arXiv:
- arXiv:2110.14427
- Bibcode:
- 2021arXiv211014427B
- Keywords:
-
- Mathematics - Statistics Theory;
- Computer Science - Machine Learning;
- 62L20;
- 60F17;
- 68T05
- E-Print:
- This arXiv version represents a major extension of the results in prior versions.The main results now allow for parameter-dependent noise, as is often the case in applications to reinforcement learning. 2 figures