Getting Double the Work Done with Half the Effort: Provenance and Metadata with Semantic Workflows
Abstract
The variety, velocity, and volume of big data are dwarfing our ability to analyze it using the computational tools and models at our disposal. Studies report that researchers spend more than 60% of their time just preparing the data for model input or data-model inter-comparison just to start a baseline in a given science project. Computational workflow systems can assist with these tasks by automating the execution of complex computations. When metadata is available, semantic workflow systems can use it to make intelligent decisions based on the type of data and models requirements. This talk will discuss the importance of provenance-aware software that both generates and uses metadata as the data is being processed, and what new capabilities are enabled for researchers. This combined system was used to develop and test a near-real time scientific workflow to facilitate the observation of the spatio-temporal distribution of whole-stream metabolism estimates using available monitoring station flow and water quality data. The data integration steps combined data from public government repositories and local sensors with the implication of different associated properties (data integrity, sampling intervals, units), and (2) the variability of the interim flows requires adaptive model selection within the framework of the metabolism calculations. These challenges are addressed by using a data integration system in which metadata and provenance are generated as the data is prepared and then subsequently used by a semantic workflow system to automatically select and configure models, effectively customizing the analysis to the daily data. Data preparation involves the extraction, cleaning, normalization and integration of the data coming from sensors and third-party data sources. In this process, the metadata and provenance captured includes sensor specifications, data types, data properties, and process documentation, and is passed along with the data on to the workflow system, which automates the generation of whole-stream metabolism estimates in near-real time by selecting the best model for each daily dataset. The entire process is captured so it is easily repeatable and can be published as provenance and metadata for the resulting data.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2012
- Bibcode:
- 2012AGUFMIN11B1465G
- Keywords:
-
- 1936 INFORMATICS / Interoperability;
- 1948 INFORMATICS / Metadata: Provenance;
- 1970 INFORMATICS / Semantic web and semantic integration;
- 1998 INFORMATICS / Workflow