Impossibility theorems for feature attribution
Abstract
Machine learning models can learn complex patterns from data, but it is often difficult to understand why they make particular predictions. To tackle this problem, practitioners typically turn to feature attribution methods, which seek to attribute the model's behavior f (x ) around an example x to particular features, or dimensions of x , that are most important for the prediction. In recent years, a new class of feature attribution methods—namely, complete and linear methods—has become popular. Our work shows that, unfortunately, such methods can be misleading: Complete and linear methods are provably less reliable than simpler methods at answering basic feature attribution questions. We provide impossibility results that highlight their failure cases and discuss how we might instead obtain reliable feature attributions.
- Publication:
-
Proceedings of the National Academy of Science
- Pub Date:
- January 2024
- DOI:
- 10.1073/pnas.2304406120
- arXiv:
- arXiv:2212.11870
- Bibcode:
- 2024PNAS..12104406B
- Keywords:
-
- Computer Science - Machine Learning;
- Computer Science - Artificial Intelligence
- E-Print:
- 38 pages, 4 figures. Updated for PNAS publication