Double Machine Learning and Automated Confounder Selection -- A Cautionary Tale
Abstract
Double machine learning (DML) has become an increasingly popular tool for automated variable selection in high-dimensional settings. Even though the ability to deal with a large number of potential covariates can render selection-on-observables assumptions more plausible, there is at the same time a growing risk that endogenous variables are included, which would lead to the violation of conditional independence. This paper demonstrates that DML is very sensitive to the inclusion of only a few "bad controls" in the covariate space. The resulting bias varies with the nature of the theoretical causal model, which raises concerns about the feasibility of selecting control variables in a data-driven way.
- Publication:
-
arXiv e-prints
- Pub Date:
- August 2021
- DOI:
- 10.48550/arXiv.2108.11294
- arXiv:
- arXiv:2108.11294
- Bibcode:
- 2021arXiv210811294H
- Keywords:
-
- Economics - Econometrics
- E-Print:
- v4: published version