Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling
Abstract
We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action spaces where conventional importance-weighting approaches suffer from excessive variance. To circumvent this variance issue, we propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect. OffCEM applies importance weighting only to action clusters and addresses the residual causal effect through model-based reward estimation. We show that the proposed estimator is unbiased under a new condition, called local correctness, which only requires that the residual-effect model preserves the relative expected reward differences of the actions within each cluster. To best leverage the CEM and local correctness, we also propose a new two-step procedure for performing model-based estimation that minimizes bias in the first step and variance in the second step. We find that the resulting OffCEM estimator substantially improves bias and variance compared to a range of conventional estimators. Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2023
- DOI:
- 10.48550/arXiv.2305.08062
- arXiv:
- arXiv:2305.08062
- Bibcode:
- 2023arXiv230508062S
- Keywords:
-
- Statistics - Machine Learning;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning
- E-Print:
- accepted at ICML2023. arXiv admin note: text overlap with arXiv:2202.06317