A hybrid machine learning model to estimate nitrate contamination of production zone groundwater in the Central Valley, California
Abstract
A hybrid, non-linear, machine learning statistical model was developed within a statistical learning framework to predict nitrate contamination of groundwater to depths of approximately 500 m below ground surface in the Central Valley, California. A database of 213 predictor variables representing well characteristics, historical and current field and county scale nitrogen mass balance, historical and current landuse, oxidation/reduction conditions, groundwater flow, climate, soil characteristics, depth to groundwater, and groundwater age were assigned to over 6,000 private supply and public supply wells measured previously for nitrate and located throughout the study area. The machine learning method, gradient boosting machine (GBM) was used to screen predictor variables and rank them in order of importance in relation to the groundwater nitrate measurements. The top five most important predictor variables included oxidation/reduction characteristics, historical field scale nitrogen mass balance, climate, and depth to 60 year old water. Twenty-two variables were selected for the final model and final model errors for log-transformed hold-out data were R squared of 0.45 and root mean square error (RMSE) of 1.124. Modeled mean groundwater age was tested separately for error improvement in the model and when included decreased model RMSE by 0.5% compared to the same model without age and by 0.20% compared to the model with all 213 variables. 1D and 2D partial plots were examined to determine how variables behave individually and interact in the model. Some variables behaved as expected: log nitrate decreased with increasing probability of anoxic conditions and depth to 60 year old water, generally decreased with increasing natural landuse surrounding wells and increasing mean groundwater age, generally increased with increased minimum depth to high water table and with increased base flow index value. Other variables exhibited much more erratic or noisy behavior in the model making them more difficult to interpret but highlighting the usefulness of the non-linear machine learning method. 2D interaction plots show probability of anoxic groundwater conditions largely control estimated nitrate concentrations compared to the other predictors.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2016
- Bibcode:
- 2016AGUFM.H33F1617R
- Keywords:
-
- 1831 Groundwater quality;
- HYDROLOGYDE: 1871 Surface water quality;
- HYDROLOGYDE: 1879 Watershed;
- HYDROLOGYDE: 1895 Instruments and techniques: monitoring;
- HYDROLOGY