Provenance in Data Interoperability for Multi-Sensor Intercomparison
Abstract
As our inventory of Earth science data sets grows, the ability to compare, merge and fuse multiple datasets grows in importance. This implies a need for deeper data interoperability than we have now. Many efforts (e.g. OPeNDAP, Open Geospatial Consortium) have broken down format barriers to interoperability; the next challenge is the semantic aspects of the data. Consider the issues when satellite data are merged, cross- calibrated, validated, inter-compared and fused. We must determine how to match up data sets that are related, yet different in significant ways: the exact nature of the phenomenon being measured, measurement technique, exact location in space-time, or the quality of the measurements. If subtle distinctions between similar measurements are not clear to the user, the results can be meaningless or even lead to an incorrect interpretation of the data. Most of these distinctions trace back to how the data came to be: sensors, processing, and quality assessment. For example, monthly averages of satellite-based aerosol measurements often show significant discrepancies, which might be due to differences in spatio-temporal aggregation, sampling issues, sensor biases, algorithm differences and/or calibration issues. This provenance information must therefore be captured in a semantic framework that allows sophisticated data inter-use tools to incorporate it, and eventually aid in the interpretation of comparison or merged products. Semantic web technology allows us to encode our knowledge of measurement characteristics, phenomena measured, space-time representations, and data quality representation in a well-structured, machine- readable ontology and rulesets. An analysis tool can use this knowledge to show users the provenance- related distinctions between two variables, advising on options for further data processing and analysis. An additional problem for workflows distributed across heterogeneous systems is retrieval and transport of provenance. Provenance information may be either embedded within the data payload, or transmitted from server to client in an out of band mechanism, through a dedicated provenance protocol, or as an addition to existing protocol standards. The out of band mechanism is more flexible in the richness of provenance information that can be accommodated, but it relies on a persistent framework. Also, if the user saves it locally for preservation purposes, a data management problem arises in keeping provenance information with the data. Therefore, we are prototyping the embedded model, incorporating the provenance within metadata objects in the data payload. Thus, it always remains with the data, no matter where the user moves them. The downside is a limit to the size of provenance metadata that we can include, an issue that will eventually need resolution to encompass the richness of provenance information required for data intercomparison and merging.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2008
- Bibcode:
- 2008AGUFMIN11C1041L
- Keywords:
-
- 0520 Data analysis: algorithms and implementation;
- 0525 Data management