EarthInsights: Parallel Clustering of Large Earth Science Datasets on the Summit Supercomputer
Abstract
Observational and modeled data acquired or generated by the various disciplines within the realm of the Earth sciences encompass temporal scales of seconds to millions of years and spatial scales of microns to tens of thousands of kilometers. Because of rapid technological advances in sensor development, computational capacity, and data storage density, the volume, complexity, and resolution of Earth science data are increasing quite rapidly. Unsupervised classification is a widely applied data mining approach to derive insights from such data; we utilized such techniques to identify ecoregions and similar phenological regions (phenoregions). However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms.
The computational platform targeted for this research is Summit, the leadership class supercomputer at Oak Ridge National Laboratory which is presently the fastest supercomputer in the world. Summit, Summit consists of 4,608 compute servers, each containing two 22-core IBM Power9 processors (CPU) and six NVIDIA Tesla V100 graphics processing unit accelerators (GPU), interconnected with high speed 100Gb/s InfiniBand interconnect providing a peak performance of 200,000 trillion calculations per second—or 200 petaflops. Specifically, we designed and optimized a parallel clustering technique based on k-means cluster analysis that can efficiently utilize heterogeneous supercomputers like Summit. We developed a hybrid programming model implementation using MPI, CUDA and OpenACC that can utilize both CPU and GPU resources on computational nodes. Furthermore, we will present details on suitable load balancing strategies including resource allocation configurations in hybrid mode. We were able to process large earth science datasets on Summit using our implementation that were hitherto infeasible to analyze on Titan, a previous generation supercomputer. When applied across a time series of geospatial data, multivariate geospatio-temporal clustering has proven quite useful for characterizing the time-evolving dynamical behavior of Earth system processes.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2018
- Bibcode:
- 2018AGUFMIN33A..04S
- Keywords:
-
- 1855 Remote sensing;
- HYDROLOGYDE: 1908 Cyberinfrastructure;
- INFORMATICSDE: 1914 Data mining;
- INFORMATICSDE: 1942 Machine learning;
- INFORMATICS