Assessing a plethora of machine learning techniques for estimating PM2.5 using satellite based AOD and meteorological information
Abstract
Particulates with a diameter less than 2.5 micrometers (PM2.5) are known to have adverse effects on public health. PM2.5 can be measured at ground monitoring stations that are often geographically sparse and thus cannot faithfully document long-term exposure to PM2.5 nationwide. Historically, numerical modeling has filled in the data gap. More recently, a zoo of machine learning (ML) methods have been proposed to estimate surface PM2.5 concentration from satellite observations of the aerosol optical depth (AOD), ground-based measurement of meteorological factors, and other miscellaneous data sources such as proximity to major roads and power plants. However, which explanatory variables and ML methods are best suited for this purpose have not been systematically examined. Here, we estimate daily PM2.5 concentration from meteorological and AOD data in NASAs MERRA-2 reanalysis spanning 2010-2020 with a spatial resolution of 0.5 x 0.625. From a pool of 22 candidate variables such as wind speed and wind direction, humidity, etc., a subset of 10 variables are automatically and objectively selected based on the Mutual Information criterion. The algorithms critically assessed include linear models (Linear, Elastic Net, Polynomial Regression, Bayesian Ridge, Support Vector Regression), four different Multi-layer Perceptrons, and decision trees (Random Forest, Extra Trees, Ada Boost, Gradient Boost, and XGBoost). Finally, the models with the three lowest cross-validated root mean squared error (RMSE) are combined via stacking, which creates a more accurate estimator than the conventional approach of selecting and adopting any single model. Additionally, the importance of including spatial features in the prediction of PM2.5 concentration is investigated with a Linear Mixed Effect model and Geographically Weighted Regression. A comparison of models trained on each grid cell are also compared to one model trained over all grid cells. Texas serves as a testbed in this pilot study, but the automated package developed here can be easily adopted in other places and using other data sources. Through open collaborative efforts with the international community, we aim to deliver this satellite data-driven product broadly to assess the health risk due to long-term exposure to PM2.5 beyond the urban environment.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2021
- Bibcode:
- 2021AGUFM.A51C..05S