A New Policy Iteration Algorithm For Reinforcement Learning in ZeroSum Markov Games
Abstract
Many modelbased reinforcement learning (RL) algorithms can be viewed as having two phases that are iteratively implemented: a learning phase where the model is approximately learned and a planning phase where the learned model is used to derive a policy. In the case of standard MDPs, the learning problem can be solved using either value iteration or policy iteration. However, in the case of zerosum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown in Hansen et al. (2013) that one has to solve Omega(1/(1alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm for Markov zerosum games, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges, and converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead in the policy improvement phase. This is appealing because lookahead is anyway often used in RL for games. We further show that lookahead can be implemented efficiently in linear Markov games, which are the counterpart of the linear MDPs and have been the subject of much attention recently. We then consider multiagent reinforcement learning which uses our algorithm in the planning phases, and provide sample and time complexity bounds for such an algorithm.
 Publication:

arXiv eprints
 Pub Date:
 March 2023
 DOI:
 10.48550/arXiv.2303.09716
 arXiv:
 arXiv:2303.09716
 Bibcode:
 2023arXiv230309716W
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Artificial Intelligence;
 Computer Science  Computer Science and Game Theory;
 Electrical Engineering and Systems Science  Systems and Control
 EPrint:
 15 pages