Doubly robust off-policy evaluation with shrinkage
Abstract
We propose a new framework for designing estimators for off-policy evaluation in contextual bandits. Our approach is based on the asymptotically optimal doubly robust estimator, but we shrink the importance weights to minimize a bound on the mean squared error, which results in a better bias-variance tradeoff in finite samples. We use this optimization-based framework to obtain three estimators: (a) a weight-clipping estimator, (b) a new weight-shrinkage estimator, and (c) the first shrinkage-based estimator for combinatorial action sets. Extensive experiments in both standard and combinatorial bandit benchmark problems show that our estimators are highly adaptive and typically outperform state-of-the-art methods.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2019
- DOI:
- 10.48550/arXiv.1907.09623
- arXiv:
- arXiv:1907.09623
- Bibcode:
- 2019arXiv190709623S
- Keywords:
-
- Computer Science - Machine Learning;
- Statistics - Machine Learning
- E-Print:
- International Conference on Machine Learning (2020)