Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

doi:10.48550/arXiv.2405.15050

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

We study the infinite-horizon average-reward reinforcement learning with linear MDPs. Previous approaches either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity, for achieving a regret bound of $\widetilde{O}(\sqrt{T})$. In this paper, we propose an algorithm that achieves the regret bound of $\widetilde{O}(\sqrt{T})$ and is computationally efficient in the sense that the time complexity is polynomial in problem parameters. Our algorithm runs an optimistic value iteration on a discounted-reward MDP that approximates the average-reward setting. With an appropriately tuned discounting factor $\gamma$, the algorithm attains the desired $\widetilde{O}(\sqrt{T})$ regret. The challenge in our approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We address this challenge by clipping the value function obtained at each value iteration step to limit the span of the value function.

Publication:

arXiv e-prints

Pub Date:

May 2024

DOI:

10.48550/arXiv.2405.15050

arXiv:

arXiv:2405.15050

Bibcode:

2024arXiv240515050H

Keywords:

Statistics - Machine Learning;
Computer Science - Machine Learning

E-Print:

Fixes an error in the analysis in the previous version by modifying the algorithm and analysis

NASA/ADS

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Abstract