Pangeo-Forge: Crowdsourcing Analysis-Ready, Cloud Optimized Data
Abstract
The effort of downloading, cleaning, and organizing data is a huge drain on the productivity of geoscientists. As datasets continue to grow in volume and complexity, this burden represents a significant obstacle to scientific progress. Analysis-ready, cloud-optimized (ARCO) data has the potential to transform geoscience research by eliminating the tedious, error prone steps of preparing data for analysis. Coupled with cloud computing, ARCO data offers scientists the ability to quickly process large, complex datasets, accelerating the time to new insights and enhancing collaboration and reproducibility.
However, someone has to make the ARCO data. We currently have an abundance of great cloud-optimized data formats and computing tools, but we don't have much ARCO data. Why not? Primary data providers, such as NASA and NOAA, focus primarily on providing archival quality data, not ARCO data. So the job falls to individual scientists, and relatively few people have the specialized skills and experience to produce useful ARCO datasets. Today, the effort to produce new ARCO data scales linearly with the amount of datasets to process, and we don't have enough manpower to do it at a large scale. Moreover, it's not clear who should bear the cost of producing and hosting ARCO data. To solve these challenges, we are developing Pangeo Forge, a platform to automate the production of ARCO data. Pangeo Forge allows users to provide a simple, code-based "recipe" for how to extract data from a primary data repository, homogenize / clean / transform it into an analysis-ready format, and deposit the resulting ARCO dataset into a cloud-based repository. The backend system then automatically executes the recipe and keeps the ARCO data up to date. This approach offers a scalable way to produce large volumes of useful ARCO data by crowdsourcing recipes from the geoscience community. Furthermore, it provides a way to track provenance throughout the data lifecycle, enhancing trust in ARCO data repositories. With support from NSF's EarthCube program, we are launching a new ARCO data repository aimed at supporting research in data-intensive fields such as high-resolution climate modeling. This presentation will describe some technical aspects of how Pangeo Forge works and explain how the community can contribute to this project.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMIN0420005A
- Keywords:
-
- 0439 Ecosystems;
- structure and dynamics;
- BIOGEOSCIENCES;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1932 High-performance computing;
- INFORMATICS;
- 1980 Spatial analysis and representation;
- INFORMATICS