Modernizing Earth and Space Science Modeling Workflows in the Big Data Era
Abstract
Modeling is a major aspect of the Earth and space science research. The development of numerical models of the Earth system, planetary systems or astrophysical systems is essential to linking theory with observations. Optimal use of observations that are quite expensive to obtain and maintain typically requires data assimilation that involves numerical models. In the Earth sciences, models of the physical climate system are typically used for data assimilation, climate projection, and inter-disciplinary research, spanning applications from analysis of multi-sensor data sets to decision-making in climate-sensitive sectors with applications to ecosystems, hazards, and various biogeochemical processes. In space physics, most models are from first principles, require considerable expertise to run and are frequently modified significantly for each case study. The volume and variety of model output data from modeling Earth and space systems are rapidly increasing and have reached a scale where human interaction with data is prohibitively inefficient. A major barrier to progress is that modeling workflows isn't deemed by practitioners to be a design problem. Existing workflows have been created by a slow accretion of software, typically based on undocumented, inflexible scripts haphazardly modified by a succession of scientists and students not trained in modern software engineering methods. As a result, existing modeling workflows suffer from an inability to onboard new datasets into models; an inability to keep pace with accelerating data production rates; and irreproducibility, among other problems. These factors are creating an untenable situation for those conducting and supporting Earth system and space science. Improving modeling workflows requires investments in hardware, software and human resources. This paper describes the critical path issues that must be targeted to accelerate modeling workflows, including script modularization, parallelization, and automation in the near term, and longer term investments in virtualized environments for improved scalability, tolerance for lossy data compression, novel data-centric memory and storage technologies, and tools for peer reviewing, preserving and sharing workflows, as well as fundamental statistical and machine learning algorithms.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2017
- Bibcode:
- 2017AGUFMIN31A0067K
- Keywords:
-
- 1916 Data and information discovery;
- INFORMATICS;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1932 High-performance computing;
- INFORMATICS;
- 1998 Workflow;
- INFORMATICS