REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

doi:10.48550/arXiv.1205.2661

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

Publication:

arXiv e-prints

Pub Date:

May 2012

DOI:

10.48550/arXiv.1205.2661

arXiv:

arXiv:1205.2661

Bibcode:

2012arXiv1205.2661B

Keywords:

Computer Science - Machine Learning

E-Print:

Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)

NASA/ADS

REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

Abstract