Facilitating Scientific Research through Workflows and Provenance on the DataONE Cyberinfrastructure (Invited)
Abstract
Provenance data has numerous applications in science. Two key ones are 1) replication: facilitate the repeatable derivation of results and 2) discovery: enable the location of data based on processing history and derivation relationships. The following scenario illustrates a typical use of provenance data. Alice, a climate scientist, has developed a VisTrails workflow to prepare Gross Primary Productivity (GPP) data. After verifying that the workflow generates data in the desired form, she uses the ReproZip tool to create a reproducible package that will enable other scientists to re-run the workflow without having to install and configure the particular libraries she is using. In addition, she exports the provenance information of the workflow execution and customizes it through a tool such as the ProvExplorer, in order to eliminate the information she regards as superfluous. She then creates and shares a DataONE data package containing the data she prepared, the ReproZip package, the customized provenance, and additional science/system metadata. Both the customized provenance and metadata are indexed by the DataONE Cyberinfrastructure (CI) for discovery purposes. Bob, another climate scientist, is looking for a benchmark GPP data to validate the Terrestrial Biosphere Model (TBM) he has developed. Searching the DataONE repository he finds Alice's data package. He retrieves its ReproZip package, customizes it (e.g. changing the spatial resolution), and re-runs it to generate the benchmark data in the form he desires. The newly generated data is then used as input for his own model evaluation workflow. His workflow generates residual maps and a Taylor diagram that enable him to evaluate the similarity between the results of his model and the benchmark data. At this point, Bob can also make use of the tools Alice used to publish his results as another discoverable and reproducible data package. In order to support these capabilities, we propose to extend the DataONE CI to be 'provenance-aware'. DataONE Member Nodes manage packages containing the data along with their science/system metadata, indexing them to facilitate search by the tools of the Information Toolkit (ITK) such as ONEMercury. Our extension is based on PBase, which can manage D-PROV provenance data efficiently. D-PROV extends the W3C PROV standard by enabling the explicit representation of workflows along with traces in a standardized manner. PBase consists of components for storage, indexing, and querying. First, it has a provenance data repository capable of dealing with the scalability and availability requirements expected from large datasets. A graph database such as Neo4j could fulfill this role. Second, PBase has a specialized index enabling efficient data access, built based on but also extending the indexing capabilities furnished by the underlying database. Finally, PBase contains a querying engine supported by the provenance index. This engine will enable easy and efficient access to the provenance data, primarily through a declarative language.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2013
- Bibcode:
- 2013AGUFMIN53E..07L
- Keywords:
-
- 1908 INFORMATICS Cyberinfrastructure;
- 1948 INFORMATICS Metadata: Provenance;
- 1998 INFORMATICS Workflow;
- 1976 INFORMATICS Software tools and services