Mission in a minute: Complex Spatial Query and Data Access in the Cloud for the ICESat-2 Mission
Abstract
How would you track elevation and slope changes occurring at an ice divide (drainage basin boundary) in Antarctica? If you were using ICESat-2, finding data would involve hundreds or thousands of orbital segments across multiple orbital cycles-- subsetting thousands of files with hundreds of gigabytes to access a narrow buffer around the line of a complex polygon. Building on foundational work in the astronomy community, we have adapted the Healpix hierarchical global data grid to define a spatial partition of ATL06 data on publicly accessible cloud object stores, with indexing extensions for fast multi-resolution subsetting and loading of the dataset using modern open source python libraries. While we use the Vaex python dataframe library for our ice divide science use case to load and process ICESat-2 data, our partitioning scheme embraces a FAIR design by using the Apache Hive metadata standard coupled with Apache Parquet self describing files. This allows the community to adapt our query and subsetting routines for use with other tools, both open source and commercial. Because Healpix indexing and query is based on subdivision of the embedding space (and not the data points themselves), the cost and complexity of spatial queries such as polygon, radius, and neighborhood search depends only on the desired spatial resolution and not on the amount of available data. This algorithm query performance means that loading a subset of data is limited only by the network bandwidth of parallel network access to data (in our case, the Amazon S3 and EC2 architectures), enabling the query, download, and loading to computer memory of selected data encompassing tens of millions of observations in under a minute within the Amazon Web Services cloud.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2021
- Bibcode:
- 2021AGUFM.C55B0594G