Frequentist Regret Bounds for Randomized LeastSquares Value Iteration
Abstract
We consider the explorationexploitation dilemma in finitehorizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimisticallyinitialized variant of the popular randomized leastsquares value iteration (RLSVI), a modelfree algorithm where exploration is induced by perturbing the leastsquares approximation of the actionvalue function. Under the assumption that the Markov decision process has lowrank transition dynamics, we prove that the frequentist regret of RLSVI is upperbounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation.
 Publication:

arXiv eprints
 Pub Date:
 November 2019
 arXiv:
 arXiv:1911.00567
 Bibcode:
 2019arXiv191100567Z
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning
 EPrint:
 AISTATS 2020