Balanced off-policy evaluation in general action spaces

doi:10.48550/arXiv.1906.03694

Balanced off-policy evaluation in general action spaces

Estimation of importance sampling weights for off-policy evaluation of contextual bandits often results in imbalance - a mismatch between the desired and the actual distribution of state-action pairs after weighting. In this work we present balanced off-policy evaluation (B-OPE), a generic method for estimating weights which minimize this imbalance. Estimation of these weights reduces to a binary classification problem regardless of action type. We show that minimizing the risk of the classifier implies minimization of imbalance to the desired counterfactual distribution of state-action pairs. The classifier loss is tied to the error of the off-policy estimate, allowing for easy tuning of hyperparameters. We provide experimental evidence that B-OPE improves weighting-based approaches for offline policy evaluation in both discrete and continuous action spaces.

Publication:

arXiv e-prints

Pub Date:

June 2019

DOI:

10.48550/arXiv.1906.03694

arXiv:

arXiv:1906.03694

Bibcode:

2019arXiv190603694S

Keywords:

Computer Science - Machine Learning;
Statistics - Methodology;
Statistics - Machine Learning

E-Print:

Accepted to AISTATS 2020

NASA/ADS

Balanced off-policy evaluation in general action spaces

Abstract