On the Structure of Earth Science Data Collections
Abstract
While there has been substantial work in the IT community regarding metadata and file identifier schemas, there appears to be relatively little work on the organization of the file collections that constitute the preponderance of Earth science data. One symptom of this difficulty appears in nomenclature describing collections: the terms `Data Product,' `Data Set,' and `Version' are overlaid with multiple meanings between communities. A particularly important aspect of this lack of standardization appears when the community attempts to developa schema for data file identifiers. There are four candidate families of identifiers: ● Randomly assigned identifiers, such as GUIDs or UUIDs, ● Segmented numerical identifiers, such as OIDs or the prefixes for DOIs, ● Extensible URL-based identifiers, such as URNs, PURL, ARK, and similar schemas, ● Text-based identifiers based on citations for papers and books, such as those suggested for the International Polar Year (IPY) citations. Unfortunately, these schema families appear to be devoid of content based on the actual structures of Earth science data collections. In this paper, we consider an organization based on an industrial production paradigm that appears to provide the preponderance of Earth science data from satellites and in situ observations. This paradigm produces a hierarchical collection structure, similar to one discussed in Barkstrom [2003: Lecture Notes in Computer Science, 2649, pp. 118-133]. In this organization, three key collection types are ● a Data Product, which is a collection of files that have similar key parameters and included data time interval, ● a Data Set, which is a collection of files within a Data Product that comes from a specified set of Data Sources, ● a Data Set Version, which is a collection of files within a Data Set for which the data producer has attempted to ensure error homogeneity. Within a Data Set Version, files appear as a time series of instances that may be identified by the starting time of the data in the file. For data intended for climate uses, it seems appropriate to state this time in terms of Astronomical Julian Date, which is a long-standing international standard that provides continuity between current observations and paleo-climatic observations. Because this collection structure is hierarchical, it could be used by either of the two hierarchical identifier schema families, although it is probably easier to use with the OID/DOI family. This hierarchical collection structure fits into the hierarchical structure of Archival Information Packages (AIPs) identified in the Open Archival Information Systems (OAIS) Reference Model. In that model, AIPs are subdivided into Archival Information Units (AIUs), which describe individual files, or Archival Information Collections (AICs). The latter can be hierarchically nested, leading to an OAIS RM-consistent collection structure that does not appear clearly in other metadata standards. This paper will also discuss the connection between these collection categories and other metadata, as well as the possible need for other organizational schemas to capture the full range of Earth science data collection structures.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2009
- Bibcode:
- 2009AGUFMIN23A1058B
- Keywords:
-
- 1904 INFORMATICS / Community standards;
- 1912 INFORMATICS / Data management;
- preservation;
- rescue;
- 1946 INFORMATICS / Metadata