Cloud futures for CMIP data - evaluating object storage models and re-evaluating federation for data distribution
Abstract
We explore the impact of cloud computing for access and analysis of CMIP data through the context of a large globally distributed data infrastructure, Earth System Grid Federation (ESGF) and JASMIN a national facility for the UK environmental sciences community.
Over the ten years since the original development of ESGF, there has been a paradigm shift in the way data is accessed and analysed. Regional clusters have been established in which CMIP data is aggregated and co-located alongside processing capability to enable in-situ analysis of the data. In the UK context, the consequences can be seen in the development of JASMIN and looking more widely, initiatives such as the ESA Exploitation Platforms and Pangeo have exploited public cloud hosting as a means to provide shared resources for data access and analysis. With the use of public cloud, traditional models for data storage and access are also being challenged with the adoption of object storage as a convenient, competitive alternative to POSIX storage. Irrespective of the storage interface, the overall costs for hosting large volumes of scientific data on public cloud remains high when compared to on-premise hosting. These trends have implications for future storage and access modes for both ESGF and JASMIN. For the latter, a policy of actively integrating cloud technologies into its on-premise solution has been pursued in order to provide technical agility: as a cloud-aware platform, applications can be more readily migrated between JASMIN and other clouds based on policy and cost considerations rather than technological limitations. Further, as providers to large volume, climate datasets we can provide a hosting environment that is amenable to cloud-native applications. To this end, pilot work is underway to evaluate different models for access and storage of CMIP6 data on its own object store. We report on testing of two different systems for storage and access: Xarray and Zarr and an in-house developed application S3-netcdf-python. We also consider functional aspects such as the challenge of balancing new models to orient data into a form optimised for user analysis against considerations such as versioning and data integrity critical for the purposes of long-term preservation and archiving.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMIN032..02K
- Keywords:
-
- 3360 Remote sensing;
- ATMOSPHERIC PROCESSES;
- 1626 Global climate models;
- GLOBAL CHANGE;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1932 High-performance computing;
- INFORMATICS