Using TileDB and Pangeo to Provide Access to Thousands of NetCDF Files as Analysis-Ready Data
Abstract
In small numbers, NetCDF files are usable as analysis-ready data objects. But when the numbers of NetCDF files to analyse get into the thousands and the analysis needs to be performed on the cloud, traditional NetCDF is no longer the best format. One way around this is to provide the contents of all the NetCDF files in a cloud-optimised, analysis-ready data format. One such data format is TileDB, a cloud-optimised universal data engine, which provides storage of labelled data (both gridded and sparse) as well as metadata, fast array IO via sharding and parallel data access, APIs in multiple languages, and more besides.
In this presentation we'll explore the process of taking thousands of NetCDF files and converting them to a single, representative, usable, logical TileDB array. We'll look at the problems and challenges of doing so - particularly relating to maintaining consistent NetCDF metadata throughout the conversion process. We'll take an in-depth look at the analysis-ready TileDB dataset produced from the NetCDF data and demonstrate running domain-specific analyses in the Pangeo cloud platform on the TileDB array, using the Python libraries Iris and Xarray. And we'll look at a real-world use-case for such a conversion, which took 150 years of NetCDF reanalysis data and converted it to TileDB for sharing with project collaborators. The context for this work is that the UK Met Office produces multiple terabytes of data per day as output from weather forecast model runs on Met Office supercomputers. On top of this, data from weather and climate research, historical and archive sources, and billions of observation measurements, means that the Met Office has a significant amount of data that could be put to work on the Cloud. The challenge for the Met Office is that the majority of the data is stored as thousands of NetCDF files that are not natively ideal for storing, processing, and extracting value from on the Cloud.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMIN0420004K
- Keywords:
-
- 0439 Ecosystems;
- structure and dynamics;
- BIOGEOSCIENCES;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1932 High-performance computing;
- INFORMATICS;
- 1980 Spatial analysis and representation;
- INFORMATICS