Evolution of data stewardship over two decades at a NASA data center
Abstract
Whether referred to as data science or data engineering, the technical nature and practice of data curation has seen a noticeable shift in the last two decades. The majority of this has been driven by factors of increasing data volumes and complexity, new data structures, and data virtualization through internet access that have themselves spawned new fields or advances in semantic ontologies, metadata, advanced distributed computing and new file formats. As a result of this shifting landscape, the role of the data scientist/engineer has also evolved.. We will discuss the key elements of this evolutionary shift from the perspective of data curation at the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC), which is one of 12 NASA Earth Science data centers responsible for archiving and distributing oceanographic satellite data since 1993. Earlier responsibilities of data curation in the history of the PO.DAAC focused strictly on data archiving, low-level data quality assessments, understanding and building read software for terse binary data or limited applications of self-describing file formats and metadata. Data discovery was often word of mouth or based on perusing simple web pages built for specific products. At that time the PO.DAAC served only a few tens of datasets. A single data engineer focused on a specific mission or suite of datasets from a specific physical parameter (e.g., ocean topography measurements). Since that time the number of datasets in the PO.DAAC has grown to approach one thousand, with increasing complexity of data and metadata structures in self-describing formats. Advances in ontologies, metadata, applications of MapReduce distributed computing and "big data", improvements in data discovery, data mining, and tools for visualization and analysis have all required new and evolving skill sets. The community began requiring more rigorous assessments of data quality and uncertainty. Although the expert knowledge of the physical domain was still critical, especially relevant to assessments of data quality, additional skills in computer science, statistics and system engineering also became necessary. Furthermore, the level of effort to implement data curation has not expanded linearly either. Management of ongoing data operations demands increased productivity on a continual basis and larger volumes of data, with constraints on funding, must be managed with proportionately less human resources. The role of data curation has also changed within the perspective of satellite missions. In many early missions, data management and curation was an afterthought (since there were no explicit data management plans written into the proposals), while current NASA mission proposals must have explicit data management plans to identify resources and funds for archiving, distribution and implementing overall data stewardship. In conclusion, the role of the data scientist/engineer at the PO.DAAC has shifted from supporting singular missions and primarily representing a point of contact for the science community to complete end-to-end stewardship through the implementation of a robust set of dataset lifecycle policies from ingest, to archiving, including data quality assessment for a broad swath of parameter based datasets that can number in the hundreds.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2013
- Bibcode:
- 2013AGUFMIN43A1642A
- Keywords:
-
- 1999 INFORMATICS General or miscellaneous;
- 1912 INFORMATICS Data management;
- preservation;
- rescue