Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

doi:10.48550/arXiv.2305.08062

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action spaces where conventional importance-weighting approaches suffer from excessive variance. To circumvent this variance issue, we propose a new estimator, called OffCEM, that is based on the conjunct effect model (CEM), a novel decomposition of the causal effect into a cluster effect and a residual effect. OffCEM applies importance weighting only to action clusters and addresses the residual causal effect through model-based reward estimation. We show that the proposed estimator is unbiased under a new condition, called local correctness, which only requires that the residual-effect model preserves the relative expected reward differences of the actions within each cluster. To best leverage the CEM and local correctness, we also propose a new two-step procedure for performing model-based estimation that minimizes bias in the first step and variance in the second step. We find that the resulting OffCEM estimator substantially improves bias and variance compared to a range of conventional estimators. Experiments demonstrate that OffCEM provides substantial improvements in OPE especially in the presence of many actions.

Publication:

arXiv e-prints

Pub Date:

May 2023

DOI:

10.48550/arXiv.2305.08062

arXiv:

arXiv:2305.08062

Bibcode:

2023arXiv230508062S

Keywords:

Statistics - Machine Learning;
Computer Science - Artificial Intelligence;
Computer Science - Machine Learning

E-Print:

accepted at ICML2023. arXiv admin note: text overlap with arXiv:2202.06317

NASA/ADS

Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling

Abstract