CMIP6 in the Cloud - Why?
Abstract
Aha! You have an awe-inspiring insight and can't wait to share your results. Then an advisor/colleague/reviewer asks "But what do the CMIPx models say?". In 2008, with the 35Tb of CMIP3 data, you could, perhaps, come up with an answer in a few days, collecting needed data for all available models, making the time and space uniform, checking units and running your analysis. In 2020, the many petabytes of CMIP6 data are globally distributed, making the download much more difficult. If you are fortunate enough to have access to a local copy of the needed CMIP6 data (e.g., on Cheyenne), someone has already done most of this work for you and your analysis is done easily. Unfortunately, Cheyenne and other similar server-side analysis platforms only collect a subset of CMIP6. Moreover, if you do not have quick access to server-side analysis, then you (and many others) must perform the long, tedious download process.
Here at LDEO, we started collecting the new CMIP6 data for our own server-side analysis platform in early 2019. As with prior CMIPx, we pooled our resources to support a full time data scientist to collect and pre-process the data as it was needed. At the same time, the Pangeo community was inspiring many of us with their vision of promoting open, reproducible, and scalable science. The Pangeo principles of 'taking the computations to the data' and 'reducing toil' also resonated with us . As a proof-of-concept activity, we started converting our own 'dark repository' of CMIP6 data into the new (parallel I/O friendly) zarr format and uploading to Google Cloud. Thanks to hosting support from Google Cloud, this effort became the CMIP6 Google Public Dataset, accessible by anyone with an internet connection (no registration, passwords, cryptocards needed) using common tools (web browser, python, etc) together with the standard Pangeo collection of open source python packages. We believe that the existence of cloud-based CMIP6 data has the potential to seriously accelerate climate science. Our attempts to fully automate the data transfer is outlined. We refer those interested in the many details (e.g, the handling of errata, support of data handles and licensing) to our Github repo. Instead, we will present aspects which might be of more general interest to the community: sustainability of the effort, lessons learned and how you can help.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMIN0310005H
- Keywords:
-
- 3360 Remote sensing;
- ATMOSPHERIC PROCESSES;
- 1626 Global climate models;
- GLOBAL CHANGE;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1932 High-performance computing;
- INFORMATICS