Globus Online: Climate Data Management for Small Teams
Abstract
Large and highly distributed climate data demands new approaches to data organization and lifecycle management. We need, in particular, catalogs that can allow researchers to track the location and properties of large numbers of data files, and management tools that can allow researchers to update data properties and organization during their research, move data among different locations, and invoke analysis computations on data--all as easily as if they were working with small numbers of files on their desktop computer. Both catalogs and management tools often need to be able to scale to extremely large quantities of data. When developing solutions to these problems, it is important to distinguish between the needs of (a) large communities, for whom the ability to organize published data is crucial (e.g., by implementing formal data publication processes, assigning DOIs, recording definitive metadata, providing for versioning), and (b) individual researchers and small teams, who are more frequently concerned with tracking the diverse data and computations involved in what highly dynamic and iterative research processes. Key requirements in the latter case include automated data registration and metadata extraction, ease of update, close-to-zero management overheads (e.g., no local software install); and flexible, user-managed sharing support, allowing read and write privileges within small groups. We describe here how new capabilities provided by the Globus Online system address the needs of the latter group of climate scientists, providing for the rapid creation and establishment of lightweight individual- or team-specific catalogs; the definition of logical groupings of data elements, called datasets; the evolution of catalogs, dataset definitions, and associated metadata over time, to track changes in data properties and organization as a result of research processes; and the manipulation of data referenced by catalog entries (e.g., replication of a dataset to a remote location for analysis, sharing of a dataset). Its software-as-a-service ('SaaS') architecture means that these capabilities are provided to users over the network, without a need for local software installation. In addition, Globus Online provides well defined APIs, thus providing a platform that can be leveraged to integrate the capabilities with other portals and applications. We describe early applications of these new Globus Online to climate science. We focus in particular on applications that demonstrate how Globus Online capabilities complement those of the Earth System Grid Federation (ESGF), the premier system for publication and discovery of large community datasets. ESGF already uses Globus Online mechanisms for data download. We demonstrate methods by which the two systems can be further integrated and harmonized, so that for example data collections produced within a small team can be easily published from Globus Online to ESGF for archival storage and broader access--and a Globus Online catalog can be used to organize an individual view of a subset of data held in ESGF.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2013
- Bibcode:
- 2013AGUFMIN21B1396A
- Keywords:
-
- 1908 INFORMATICS Cyberinfrastructure;
- 1912 INFORMATICS Data management;
- preservation;
- rescue;
- 1976 INFORMATICS Software tools and services