Provenance Tracking for Earth Science Data and Its Metadata Representation
Abstract
In many cases Earth science data production involves long and complex chains of processes that accept files as input and create new files. It can be demonstrated that these chains form a mathematical graph in which the files and processes are the vertices and the relations between the files and vertices form the edges. There are four types of edges, which can be represented by the relations "this file was produced by that process," "this file is used by those processes," "this process needs these files for input," and "this process produces those files as output." Because Earth science data production often involves using previous data for statistical quality control, provenance graphs can be very large. For example, if previous data are used to develop statistics of clear sky radiances, a particular file may depend on statistics collected on many months of data. For EOS data or for the upcoming NPP and NPOESS missions, the number of files being ingested per day can be in the range of 10,000 to 100,000. As a result, the number of vertices and links can easily be in the millions to hundreds of millions of objects. This suggests that routine inclusion of complete provenance graphs in single files may be prohibitively voluminous, although the increasingly stringent requirements for provenance tracking require maintenance of the information from which the graph can be reliably constructed. The fact that provenance tracking requires traversal of the vertices and edges of the graph makes it difficult to uniquely fit into eXtensible Markup Language (XML). It also makes the construction of the graph difficult to do in standard Structured Query Language (SQL) because the tables for representing the graph require recursive queries. Both of these difficulties require care in constructing the data structures and the software that stores and makes the metadata accessible. This paper will then discuss the representation of this structure in metadata, including the possibilities in the ISO 19115 family of geospatial standards.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2007
- Bibcode:
- 2007AGUFMIN22A..03B
- Keywords:
-
- 0520 Data analysis: algorithms and implementation;
- 0525 Data management;
- 0530 Data presentation and visualization;
- 1600 GLOBAL CHANGE;
- 6339 System design