End-To-End Metadata Curation Workflows for Improved Data Discovery: Experience of the Earthcube Data Discovery Hub Project
Abstract
Automated enhancement of metadata content has been widely touted as a solution for improving metadata quality, particularly for datasets. With the increase in volumes and variety of data types, many data providers do not have sufficient curatorial resources for ongoing update and improvement of metadata in their catalogs.
The EarthCube Data Discovery Hub project has been working on integrating enhancement processes into its metadata aggregation workflow. This is motivated by the heterogeneity of metadata that is exposed for harvesting and indexing, even when it is encoded using standard metadata interchange formats like CSDGM, ISO19139, or DataCite XML (to name a few). Existing enhancement processes include keyword additions, spatial georeferencing and association of URIs with organizations; new processes are under development to extract entity and attribute information. The discovery hub also enables community registration of new resources, as well as edits to individual or groups or records. The project presents a new metadata curation model, where records are initially generated and enhanced using an extensible metadata augmentation pipeline, while the role of the curators is to assess and tune the enhancement process by validating machine-generated metadata elements, applying the enhancements to records within a selected scope, and re-running the pipeline. Metadata might be updated in different locations in this dynamic update and continuous improvement environment, creating a challenge to keep metadata descriptions of resources from various providers in sync, and challenging users to determine what is the most reliable metadata. Conflicts arising in the DDH aggregator when harvested metadata has been updated by a provider and are inconsistent with enhanced metadata in the aggregator must be manually resolved by curators at this time. The DDH aggregator records a provenance trace for updates applied to harvested records. These are accessible through catalog search results to help users assess reliability of the metadata. In the long run, our goal would be to feed these update traces back to the original providers for them to evaluate and apply as they see fit, in a feedback loop to improve and synchronize metadata in the federated catalog system.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2018
- Bibcode:
- 2018AGUFMIN51A..06R
- Keywords:
-
- 1916 Data and information discovery;
- INFORMATICSDE: 1930 Data and information governance;
- INFORMATICSDE: 1946 Metadata;
- INFORMATICSDE: 1976 Software tools and services;
- INFORMATICS