Enabling Scalable Integration of Diverse Data
Abstract
Scaling from single investigator datasets to building community data resources is a difficult challenge. In a single investigator dataset, it is possible to curate the data and work with custom formats by hand. When this same data is collected by large numbers of investigators to contribute to a community dataset, standards and automation are required. In our work with the U.S. Department of Energy's (DOE) AmeriFlux, FLUXNET, NGEE-Tropics, Watershed Function SFA, and ESS-DIVE projects, we need to combine data from tens to thousands of field investigators. For example, AmeriFlux is a network of more than three hundred PI-managed sites measuring ecosystem CO2, water, and energy fluxes in North, Central and South America with over seventeen hundred users of the combined dataset. NGEE-Tropics and Watershed Function SFA are both large, multi-institution teams investigating diverse ecosystems with extensive measurement campaigns, contributed data, and measurement infrastructure. This presentation will describe our experience using a mix of community outreach, user experience research techniques, automation, and a data integration broker to enable development of scalable, integrated, community resources. These community resources provide tools that allow the user to view widely diverse data as a single integrated data resource.
To accomplish this goal, we work closely with the data collectors to standardize data collection and the collection process. The standardized data then allow development of semi-automated data quality checking, curation, and data product generation. Data brokering capabilities have allowed us to present the data in an integrated view enabling large numbers of users to search and retrieve diverse data through a unified view. Our next scaling challenge is to enable these same capabilities at a data archive. ESS-DIVE is a new DOE funded data archive serving data from projects in DOE's Environmental Systems Science programs. These projects investigate ecosystems spanning the subsurface to the atmospheric interface. In this presentation we will also describe how we plan to provide an integrated community data resource capability on the data from the hundreds to thousands of projects that contribute to ESS-DIVE data archive.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2018
- Bibcode:
- 2018AGUFMIN34B..01A
- Keywords:
-
- 1912 Data management;
- preservation;
- rescue;
- INFORMATICSDE: 1916 Data and information discovery;
- INFORMATICSDE: 1930 Data and information governance;
- INFORMATICSDE: 1942 Machine learning;
- INFORMATICS