Developing an EarthCube Governance Structure for Big Data Preservation and Access
Abstract
The underlying vision of the NSF EarthCube initiative is of an enduring resource serving the needs of the earth sciences for today and the future. We must therefore view this effort through the lens of what the earth sciences will need tomorrow and on how the underlying processes of data compilation, preservation, and access interplay with the scientific processes within the communities EarthCube will serve. Key issues that must be incorporated into the EarthCube governance structure include authentication, retrieval, and unintended use cases, the emerging role of whole-corpus data mining, and how inventory, citation, and archive practices will impact the ability of scientists to use EarthCube's collections into the future. According to the National Academies, the US federal government spends over $140 billion dollars a year in support of the nation's research base. Yet, a critical issue confronting all of the major scientific disciplines in building upon this investment is the lack of processes that guide how data are preserved for the long-term, ensuring that studies can be replicated and that experimental data remains accessible as new analytic methods become available or theories evolve. As datasets are used years or even decades after their creation, far richer metadata is needed to describe the underlying simulation, smoothing algorithms or bounding parameters of the data collection process. This is even truer as data are increasingly used outside their intended disciplines, as geoscience researchers apply algorithms from one discipline to datasets from another, where their analytical techniques may make extensive assumptions about the data. As science becomes increasingly interdisciplinary and emerging computational approaches begin applying whole-corpus methodologies and blending multiple archives together, we are facing new data access modalities distinct from the needs of the past, drawing into focus the question of centralized versus distributed architectures. In the past geoscience data have been distributed, with each site maintaining its own collections and centralized inventory metadata supporting discovery. This was based on the historical search-browse-download modality where access was primarily to download a copy to a researcher's own machine and datasets were measured in gigabytes. New "big data" approaches to the geosciences are already demonstrating the need to analyze the entirety of multiple collections from multiple sites totaling hundreds of terabytes in size. Yet, datasets are outpacing the ability of networks to share them, forcing a new paradigm in high-performance computing where computation must migrate to centralized data stores. The next generation of geoscientists are going to need a system designed for exploring and understanding data from multiple scientific domains and vantages where data are preserved for decades. We are not alone in this endeavor and there are many lessons we can learn from similar initiatives such as more than 40 years of governance policies for data warehouses and 15 years of open web archives, all of which face the same challenges. The entire EarthCube project will fail if the new governance structure does not account for the needs of integrated cyberinfrastructure that allows big data to stored, archived, analyzed, and made accessible to large numbers of scientists.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2012
- Bibcode:
- 2012AGUFMIN21A1465L
- Keywords:
-
- 9800 GENERAL OR MISCELLANEOUS