The catalogCleaner: Separating the Sheep from the Goats
Abstract
The Global Earth Observation Integrated Data Environment (GEO-IDE) is NOAA's effort to successfully integrate data and information with partners in the national US-Global Earth Observation System (US-GEO) and the international Global Earth Observation System of Systems (GEOSS). As part of the GEO-IDE, the Unified Access Framework (UAF) is working to build momentum towards the goal of increased data integration and interoperability. The UAF project is moving towards this goal with an approach that includes leveraging well known and widely used standards and focusing initially on well understood data types, such as gridded data from climate models. This phased approach serves to engage data providers and users and also has a high probability of demonstrable successes. The UAF project shares the widely held conviction that the use of data standards is a key ingredient necessary to achieve interoperability. Many community-based consensus standards fail, though, due to poor compliance. Compliance problems emerge for many reasons: because the standards evolve through versions, because documentation is ambiguous or because individual data providers find the standard inadequate as-is to meet their special needs. In addition, minimalist use of standards will lead to a compliant service, but one which is of low quality. For example, serving five hundred individual files from a single climate model might be compliant, but enhancing the service so that those files are all aggregated together into one virtual dataset and available through a single access URL provides a much more useful service. The UAF project began showcasing the advantages of providing compliant data by manually building a master catalog generated from hand-picked THREDDS servers. With an understanding that educating data managers to provide standards compliant data and metadata can take years, the UAF project wanted to continue increasing the volume of data served through the master catalog as much as possible. However, it quickly became obvious, through the sheer volume of data servers available, that the manual process of building a master catalog was not scalable. Thus, the idea for the catalogCleaner tool was born. The goal of this tool is to automatically crawl a remote OPeNDAP or THREDDS server, and from the information in the server build a "clean" catalog of data that will be: a) served through uniform access services; b) have CF compliant metadata; c) directly link the data to common visualization tools thereby allowing users to immediately begin exploring actual data. In addition, the UAF-generated clean catalog can then be used to drive data discovery tools such as Geoportal, GI-CAT, etc. This presentation will further explore the motivation of creating this tool, the implementation of this tool, as well as the myriad of challenges and difficulties there were encountered along the way.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2012
- Bibcode:
- 2012AGUFMIN44A..02O
- Keywords:
-
- 1916 INFORMATICS / Data and information discovery;
- 1936 INFORMATICS / Interoperability;
- 1996 INFORMATICS / Web Services