ONEMercury: Towards Automatic Annotation of Earth Science Metadata
Abstract
Earth sciences have become more data-intensive, requiring access to heterogeneous data collected from multiple places, times, and thematic scales. For example, research on climate change may involve exploring and analyzing observational data such as the migration of animals and temperature shifts across the earth, as well as various model-observation inter-comparison studies. Recently, DataONE, a federated data network built to facilitate access to and preservation of environmental and ecological data, has come to exist. ONEMercury has recently been implemented as part of the DataONE project to serve as a portal for discovering and accessing environmental and observational data across the globe. ONEMercury harvests metadata from the data hosted by multiple data repositories and makes it searchable via a common search interface built upon cutting edge search engine technology, allowing users to interact with the system, intelligently filter the search results on the fly, and fetch the data from distributed data sources. Linking data from heterogeneous sources always has a cost. A problem that ONEMercury faces is the different levels of annotation in the harvested metadata records. Poorly annotated records tend to be missed during the search process as they lack meaningful keywords. Furthermore, such records would not be compatible with the advanced search functionality offered by ONEMercury as the interface requires a metadata record be semantically annotated. The explosion of the number of metadata records harvested from an increasing number of data repositories makes it impossible to annotate the harvested records manually, urging the need for a tool capable of automatically annotating poorly curated metadata records. In this paper, we propose a topic-model (TM) based approach for automatic metadata annotation. Our approach mines topics in the set of well annotated records and suggests keywords for poorly annotated records based on topic similarity. We utilize the tag prediction protocol to evaluate our method against a TFIDF-based baseline, using 10 fold cross validation with 5 evaluation metrics: Precision, Recall, F1 score, Mean Reciprocal Rank, and Binary Preference. These evaluation metrics are extensively used together to evaluate recommendation systems. Our experiments on four sets of metadata records harvested from the ORNL DAAC, Dryad Digital Repository, Knowledge Network for Biocomplexity, and TreeBASE: a repository of phylogenetic information, show that our TM based approach significantly outperforms the baseline.; Evaluation results (Precision-vs-Recall plots) for the Topic-Model approach over the TFIDF method for each repository
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2012
- Bibcode:
- 2012AGUFMIN41A1482T
- Keywords:
-
- 1906 INFORMATICS / Computational models;
- algorithms;
- 1916 INFORMATICS / Data and information discovery;
- 1920 INFORMATICS / Emerging informatics technologies;
- 1996 INFORMATICS / Web Services