A Novel Machine Learning Method for Surface PM2.5 Estimations from Geostationary Satellites

A Novel Machine Learning Method for Surface PM2.5 Estimations from Geostationary Satellites

Particulate matter (PM) with a diameter of less or equal to 2.5 m, known as PM2.5, affects human health as it penetrates the respiratory system. The Environmental Protection Agency (EPA) measures the atmospheric concentration of PM2.5 using air quality monitors stationed throughout the Continental United States (CONUS). Such measurements are points on a spatial domain and therefore, might not be representative of the air quality at nearby areas considering that the composition of the atmosphere is highly variable from place to place. Satellite based AOD permits a spatially uniform means of estimating PM2.5 and new geostationary satellites provide high temporal and spatial resolution estimation of AOD. However, the concentration of PM2.5 is non-linearly dependent on other atmospheric parameters that include relative humidity, temperature, and height of the planetary boundary layer. This information may be estimated at similar spatial and temporal resolutions as AOD from numerical modeling such as from the National Oceanic and Atmospheric Administrations (NOAA) High Resolution Rapid Refresh (HRRR) model which resolves near real-time atmospheric conditions over the CONUS. The estimation of PM2.5 concentration is a multi-parametric problem that considers the effect of temporal dependencies among the different parameters. Deep learning approaches are appropriate for such complex estimation problems as they intrinsically capture relations among multiple non-linear parameters. This study compares deep-learning methods to traditional regression analysis to demonstrate the capabilities of these methods in predicting PM2.5 concentrations. Additionally, a novel ensemble learning approach is employed to identify scientific processes that could further improve the estimation of PM2.5 concentration. Utilizing Long Short-Term Memory (LSTM) neural networks, which are suitable for multivariate time series estimation problems as they are capable of learning long-term dependencies, individual models are created for each EPA station and trained on the aforementioned dataset collocated over each station. Individual station models are merged if the model's performance is improved by reducing the root mean squared error (RMSE) metric. This ensemble training method ultimately reduces the RMSE value. Evaluation of these results provide insights into physical processes and related observable parameters that may contribute to PM2.5 concentrations. Identified parameters evaluated to be statistically different between the merged and unmerged models are expected to improve overall performance. These new parameters are then utilized for reevaluation of the deep learning methods with an extreme gradient boosting model with an RMSE of 5.5 providing the best results.

Publication:: AGU Fall Meeting Abstracts
Pub Date:: December 2021
Bibcode:: 2021AGUFM.A25J1812P

NASA/ADS