The SeaView EarthCube project: Lessons Learned from Integrating Across Repositories
Abstract
SeaView is an NSF-funded EarthCube Integrative Activity Project working with 5 existing data repositories* to provide oceanographers with highly integrated thematic data collections in user-requested formats. The project has three complementary goals:
Supporting Scientists: SeaView targets scientists' need for easy access to data of interest that are ready to import into their preferred tool. Strengthening Repositories: By integrating data from multiple repositories for science use, SeaView is helping the ocean data repositories align their data and processes and make ocean data more accessible and easily integrated. Informing EarthCube (earthcube.org): SeaView's experience as an integration demonstration can inform the larger NSF EarthCube architecture and design effort. The challenges faced in this small-scale effort are informative to geosciences cyberinfrastructure more generally. Here we focus on the lessons learned that may inform other data facilities and integrative architecture projects. (The SeaView data collections will be presented at the Ocean Sciences 2018 meeting.) One example is the importance of shared semantics, with persistent identifiers, for key integration elements across the data sets (e.g. cruise, parameter, and project/program.) These must allow for revision through time and should have an agreed authority or process for resolving conflicts: aligning identifiers and correcting errors were time consuming and often required both deep domain knowledge and "back end" knowledge of the data facilities. Another example is the need for robust provenance, and tools that support automated or semi-automated data transform pipelines that capture provenance. Multiple copies and versions of data are now flowing into repositories, and onward to long-term archives such as NOAA NCEI and umbrella portals such as DataONE. Exact copies can be identified with hashes (for those that have the skills), but it can be painfully difficult to understand the processing or format changes that differentiates versions. As more sensors are deployed, and data re-use increases, this will only become more challenging. We will discuss these, and additional lessons learned, as well as invite discussion and solutions from others doing similar work. * BCO-DMO, CCHDO, OBIS, OOI, R2R- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2017
- Bibcode:
- 2017AGUFMIN33C0140D
- Keywords:
-
- 1916 Data and information discovery;
- INFORMATICS;
- 1920 Emerging informatics technologies;
- INFORMATICS;
- 1982 Standards;
- INFORMATICS;
- 1998 Workflow;
- INFORMATICS