Variance Reduction in Actor Critic Methods (ACM)
Abstract
After presenting Actor Critic Methods (ACM), we show ACM are control variate estimators. Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the $L^2$ norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC most often referred to as A2C methods in deep policy gradient methods. This enables us to derive a new formulation for Advantage Actor Critic methods that has lower variance and improves the traditional A2C method.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2019
- DOI:
- arXiv:
- arXiv:1907.09765
- Bibcode:
- 2019arXiv190709765B
- Keywords:
-
- Computer Science - Machine Learning;
- Statistics - Machine Learning