How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods
Abstract
Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, machine learning consistently outperforms BISG at individual classification of race/ethnicity. Second, BISG and machine learning methods exhibit divergent biases for estimating regional racial composition. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.
- Publication:
-
arXiv e-prints
- Pub Date:
- June 2022
- DOI:
- 10.48550/arXiv.2206.14583
- arXiv:
- arXiv:2206.14583
- Bibcode:
- 2022arXiv220614583D
- Keywords:
-
- Computer Science - Machine Learning