Impossibility theorems for feature attribution

doi:10.1073/pnas.2304406120

Impossibility theorems for feature attribution

Machine learning models can learn complex patterns from data, but it is often difficult to understand why they make particular predictions. To tackle this problem, practitioners typically turn to feature attribution methods, which seek to attribute the model's behavior f (x ) around an example x to particular features, or dimensions of x , that are most important for the prediction. In recent years, a new class of feature attribution methods—namely, complete and linear methods—has become popular. Our work shows that, unfortunately, such methods can be misleading: Complete and linear methods are provably less reliable than simpler methods at answering basic feature attribution questions. We provide impossibility results that highlight their failure cases and discuss how we might instead obtain reliable feature attributions.

Publication:

Proceedings of the National Academy of Science

Pub Date:

January 2024

DOI:

10.1073/pnas.2304406120

arXiv:

arXiv:2212.11870

Bibcode:

2024PNAS..12104406B

Keywords:

Computer Science - Machine Learning;
Computer Science - Artificial Intelligence

E-Print:

38 pages, 4 figures. Updated for PNAS publication

NASA/ADS

Impossibility theorems for feature attribution

Abstract