Scaling to diversity: The DERECHOS distributed infrastructure for analyzing and sharing data
Abstract
Integrating Earth Science data from diverse sources such as satellite imagery and simulation output can be expensive and time-consuming, limiting scientific inquiry and the quality of our analyses. Reducing these costs will improve innovation and quality in science. The current Earth Science data infrastructure focuses on downloading data based on requests formed from the search and analysis of associated metadata. And while the data products provided by archives may use the best available data sharing technologies, scientist end-users generally do not have such resources (including staff) available to them. Furthermore, only once an end-user has received the data from multiple diverse sources and has integrated them can the actual analysis and synthesis begin. The cost of getting from idea to where synthesis can start dramatically slows progress. In this presentation we discuss a distributed computational and data storage framework that eliminates much of the aforementioned cost. The SciDB distributed array database is central as it is optimized for scientific computing involving very large arrays, performing better than less specialized frameworks like Spark. Adding spatiotemporal functions to the SciDB creates a powerful platform for analyzing and integrating massive, distributed datasets. SciDB allows Big Earth Data analysis to be performed "in place" without the need for expensive downloads and end-user resources. Spatiotemporal indexing technologies such as the hierarchical triangular mesh enable the compute and storage affinity needed to efficiently perform co-located and conditional analyses minimizing data transfers. These technologies automate the integration of diverse data sources using the framework, a critical step beyond current metadata search and analysis. Instead of downloading data into their idiosyncratic local environments, end-users can generate and share data products integrated from diverse multiple sources using a common shared environment, turning distributed active archive centers (DAACs) from warehouses into distributed active analysis centers.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2016
- Bibcode:
- 2016AGUFMIN51C1869R
- Keywords:
-
- 3360 Remote sensing;
- ATMOSPHERIC PROCESSESDE: 1910 Data assimilation;
- integration and fusion;
- INFORMATICSDE: 1926 Geospatial;
- INFORMATICSDE: 1996 Web Services;
- INFORMATICS