A Sharp Analysis of Modelbased Reinforcement Learning with SelfPlay
Abstract
Modelbased algorithmsalgorithms that decouple learning of the model and planning given the modelare widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for singleagent reinforcement learning in Markov Decision Processes (MDPs). However, for multiagent reinforcement learning in Markov games, the current best known sample complexity for modelbased algorithms is rather suboptimal and compares unfavorably against recent modelfree approaches. In this paper, we present a sharp analysis of modelbased selfplay algorithms for multiagent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (NashVI) for twoplayer zerosum Markov games that is able to output an $\epsilon$approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This is the first algorithm that matches the informationtheoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor, and compares favorably against the best known modelfree algorithm if $\min\{A,B\}=o(H^3)$. In addition, our NashVI outputs a single Markov policy with optimality guarantee, while existing sampleefficient modelfree algorithms output a nested mixture of Markov policies that is in general nonMarkov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient taskagnostic algorithm for zerosum Markov games, and designing the first line of provably sampleefficient algorithms for multiplayer generalsum Markov games.
 Publication:

arXiv eprints
 Pub Date:
 October 2020
 arXiv:
 arXiv:2010.01604
 Bibcode:
 2020arXiv201001604L
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Artificial Intelligence;
 Statistics  Machine Learning