Enhancing NASA Earth Science Data Discovery from Scientific Publications
Abstract
Earth o bservations from space borne instruments have evolved explosively in the past decades. Following closely are reanalysis systems assimilating model and observational data, yielding even longer records and larger number of variables. Thanks to advances in internet technology, it is now easier than ever to visualize and analyze these data using web interfaces. On the other hand, it also becomes an increasingly daunting task to build upon the existing knowledge published in various peer-reviewed sources, and navigate toward the most relevant data, analysis, and visualization.
We present an analysis of a subset of publications that utilized a popular visualization web interface at the NASA Goddard Earth Science Data and Information Services Center. Known as "Giovanni", it allows researchers from wide backgrounds to work with hundreds of variables from space observations and assimilation systems. Since coming online more than a decade ago, Giovanni has been credited in more than 100 papers per year, and the total count now is estimated to be nearly 1,500. Many of these papers contain valuable information about when, where and how Giovanni has been used, and hence forge an opportunity to learn and share the knowledge of which variables were used for what research projects. The purpose of our work is to retrieve the information from the papers and organize it as a knowledge repository which links together datasets, variables, places, dates and phenomena all of which reflect the essence of the published research. Since the publications are unstructured texts, we use natural language processing along with machine learning methods in the retrieval process. One of the challenges is deciphering the dataset names, because in many cases researchers refer to variables, rather than the datasets containing them. To constrain the number of terms, we deploy Earth Science ontologies as dictionaries for the term extraction. We demonstrate that storing these terms and underlying ontologies, along with datasets, variables and papers in the knowledge graph database, enables various linkages between all these entities facilitating the data discovery. Thus, we are setting a qualitatively new stage in improvements of web data interfaces, where machine learning techniques are used to establish and optimize usage-based discovery of data.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2020
- Bibcode:
- 2020AGUFMIN0140005G
- Keywords:
-
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1938 Knowledge representation and knowledge bases;
- INFORMATICS;
- 1954 Natural language processing;
- INFORMATICS;
- 1958 Ontologies;
- INFORMATICS