Simplify and Accelerate Earth Science Data Preparation to Systemize Machine Learning
Abstract
Data preparation is the most laborious and time-consuming part of machine learning. The effort required is usually more than linearly proportional to the varieties of data used. From a system science viewpoint, useful machine learning in Earth Science likely involves diverse datasets. Thus, simplifying data preparation to ease the systemization of machine learning in Earth Science is of immense value. The technologies we have developed and applied to an array database, SciDB, are explicitly designed for the purpose, including the innovative SpatioTemporal Adaptive-Resolution Encoding (STARE), a remapping tool suite, and an efficient implementation of connected component labeling (CCL). STARE serves as a universal Earth data representation that homogenizes data varieties and facilitates spatiotemporal data placement as well as alignment, to maximize query performance on massively parallel, distributed computing resources for a major class of analysis. Moreover, it converts spatiotemporal set operations into fast and efficient integer interval operations, supporting in turn moving-object analysis. Integrative analysis requires more than overlapping spatiotemporal sets. For example, meaningful comparison of temperature fields obtained with different means and resolutions requires their transformation to the same grid. Therefore, remapping has been implemented to enable integrative analysis. Finally, Earth Science investigations are generally studies of phenomena, e.g. tropical cyclone, atmospheric river, and blizzard, through their associated events, like hurricanes Katrina and Sandy. Unfortunately, except for a few high-impact phenomena, comprehensive episodic records are lacking. Consequently, we have implemented an efficient CCL tracking algorithm, enabling event-based investigations within climate data records beyond mere event presence. In summary, we have implemented the core unifying capabilities on a Big Data technology to enable systematic machine learning in Earth Science.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2017
- Bibcode:
- 2017AGUFMIN11E..07K
- Keywords:
-
- 1914 Data mining;
- INFORMATICS;
- 1916 Data and information discovery;
- INFORMATICS;
- 1942 Machine learning;
- INFORMATICS;
- 1968 Scientific reasoning/inference;
- INFORMATICS