Entity Linking Leveraging the GeoDeepDive Cyberinfrastructure and Managing Uncertainty with Provenance.
Abstract
Unlocking the full, rich, network of links between the scientific literature and the real world entities to which data correspond - such as field expeditions (cruises) on oceanographic research vessels and physical samples collected during those expeditions - remains a challenge for the geoscience community. Doing so would enable data reuse and integration on a broad scale; making it possible to inspect the network and discover, for example, all rock samples reported in the scientific literature found within 10 kilometers of an undersea volcano, and associated geochemical analyses. Such a capability could facilitate new scientific discoveries. The GeoDeepDive project provides negotiated access to 4.2+ million documents from scientific publishers, enabling text and document mining via a public API and cyberinfrastructure. We mined this corpus using entity linking techniques, which are inherently uncertain, and recorded provenance information about each link. This opens the entity linking methodology to scrutiny, and enables downstream applications to make informed assessments about the suitability of an entity link for consumption. A major challenge is how to model and disseminate the provenance information. We present results from entity linking between journal articles, research vessels and cruises, and physical samples from the Petrological Database (PetDB), and incorporate Linked Data resources such as cruises in the Rolling Deck to Repository (R2R) catalog where possible. Our work demonstrates the value and potential of the GeoDeepDive cyberinfrastructure in combination with Linked Data infrastructure provided by the EarthCube GeoLink project. We present a research workflow to capture provenance information that leverages the World Wide Web Consortium (W3C) recommendation PROV Ontology.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2017
- Bibcode:
- 2017AGUFMIN33B0118M
- Keywords:
-
- 1916 Data and information discovery;
- INFORMATICS;
- 1930 Data and information governance;
- INFORMATICS;
- 1970 Semantic web and semantic integration;
- INFORMATICS;
- 1986 Statistical methods: Inferential;
- INFORMATICS