Inference at the data's edge: Gaussian processes for modeling and inference under model-dependency, poor overlap, and extrapolation
Abstract
The Gaussian Process (GP) is a highly flexible non-linear regression approach that provides a principled approach to handling our uncertainty over predicted (counterfactual) values. It does so by computing a posterior distribution over predicted point as a function of a chosen model space and the observed data, in contrast to conventional approaches that effectively compute uncertainty estimates conditionally on placing full faith in a fitted model. This is especially valuable under conditions of extrapolation or weak overlap, where model dependency poses a severe threat. We first offer an accessible explanation of GPs, and provide an implementation suitable to social science inference problems. In doing so we reduce the number of user-chosen hyperparameters from three to zero. We then illustrate the settings in which GPs can be most valuable: those where conventional approaches have poor properties due to model-dependency/extrapolation in data-sparse regions. Specifically, we apply it to (i) comparisons in which treated and control groups have poor covariate overlap; (ii) interrupted time-series designs, where models are fitted prior to an event by extrapolated after it; and (iii) regression discontinuity, which depends on model estimates taken at or just beyond the edge of their supporting data.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2024
- DOI:
- arXiv:
- arXiv:2407.10442
- Bibcode:
- 2024arXiv240710442C
- Keywords:
-
- Statistics - Methodology;
- Statistics - Machine Learning
- E-Print:
- Draft manuscript