Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation
Abstract
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of $\widetilde{\mathcal{O}}(d^{3/2}\mathrm{sp}(v^*)\sqrt{T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of $\widetilde{\mathcal{O}}(d\cdot\mathrm{sp}(v^*)\sqrt{T})$. The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2024
- DOI:
- arXiv:
- arXiv:2409.10772
- Bibcode:
- 2024arXiv240910772C
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Data Structures and Algorithms;
- Mathematics - Optimization and Control
- E-Print:
- The main results of this submission were derived based on discussions with the authors of paper "Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs" (arXiv:2405.15050). We realized that they deduced the same results earlier than us. In response, we retract the submission