GeoCrate: Modern Data Containers For Geophysical Data in the Cloud
Abstract
UNAVCO Geodetic Data Services and IRIS Data Services are collaborating on the migration of two independent self-managed data systems into a single cloud-based platform. As part of this migration and redesign work, we are implementing modern data containers for a range of geophysical data types supported by our facilities. The primary goals of this effort are to support efficient data center operations, and, importantly, to enable direct user access for efficient processing of archived data sets collected from multiple geophysical sensor types. Initial work has been the development of several prototype data containers using TileDB's embedded storage engine. Early testing has demonstrated significantly improved performance for some of our more complex and important use cases. Our objective is to leverage many of the attributes of TileDB arrays in our designs, including: multiple-dimensional slicing, compression filters, parallel IO, cloud storage backends, support for sparse arrays, and versioning. The implementation of new geophysical containers will enable the development of a more diverse set of data queries across internal repositories, enabling improvements in our data discovery capabilities. In addition, researchers using data-proximate, horizontally scalable cloud resources will be able to take advantage of parallel access to our very large datasets, significantly reducing Extract, Transform and Load (ETL) timescales. An important objective satisfied by TileDB is interfaces to common programming languages ( Python, R, C/C++, Java, Golang, C#) and frameworks (Xarray, Dask, Pandas) used by the research communities we support. We will report on the status of our development, challenges we've encountered, and show some examples using our prototypes.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2022
- Bibcode:
- 2022AGUFMIN42D0350B