Machine Learning Models to Predict the Occurrence of Arsenic in Domestic Wells throughout the Conterminous United States
Abstract
Arsenic from geologic sources is widespread in groundwater in the United States (U.S.) and concentrations can exceed the U.S. Environmental Protection Agency maximum contaminant level (MCL) of 10 mg/L established for public drinking-water supplies. New Jersey and New Hampshire have adopted a lower MCL of 5 mg/L for arsenic in public drinking water. Unlike public drinking water supplies, domestic water supplies are largely unregulated and generally do not require testing for arsenic. As part of a collaborative project with epidemiologists and public health scientists from multiple agencies and academic institutions, the U.S. Geological Survey is developing statistical models using boosted regression trees (BRT) and random forest classification (RFC) techniques to characterize the occurrence of arsenic in domestic water supply wells throughout the conterminous U.S. Groundwater arsenic data from approximately 20,000 domestic wells in the U.S. were used to train and test the models. The results from these models will be used to estimate arsenic exposure probabilities in subsequent epidemiological studies.
Three BRT models predict the probability of arsenic concentrations in domestic water supply wells being greater than 1 mg/L, 5 mg/L, and 10 mg/L. The RFC model predicts the most probable arsenic concentration category for an area. The arsenic concentration categories correspond to the threshold values in the BRT models and are ≤1 mg/L, >1 to ≤5 mg/L, >5 to ≤10 mg/L, and >10 mg/L. Predictor variables in the models include soil geochemistry, bedrock lithology, average annual precipitation and groundwater recharge, other hydrologic variables, and surrogate variables for processes associated with arsenic mobility and solubility in groundwater. Preliminary results indicate general consistency in predictor variables between different models. These include average annual precipitation, relative position of groundwater along a flow path, and soil geochemistry. Overall model prediction accuracy to data held out from model development (i.e. hold-out data) is also generally comparable (~90%) between models. The purpose of these exposure models is to pair them with health outcome datasets such as cancer incidence/mortality and birth weights, and biomonitoring datasets.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2019
- Bibcode:
- 2019AGUFMGH23B1246L
- Keywords:
-
- 0240 Public health;
- GEOHEALTH;
- 1831 Groundwater quality;
- HYDROLOGY;
- 1847 Modeling;
- HYDROLOGY;
- 1880 Water management;
- HYDROLOGY