Preliminary Evaluation of MapReduce for High-Performance Climate Data Analysis
Abstract
Data intensive analytic workflows bridge between the largely unstructured mass of stored scientific data and the highly structured, tailored, reduced, and refined products used by scientists in their research. In general, the initial steps of an analysis, those operations that first interact with a data repository, tend to be the most general, while data manipulations closer to the client tend to be the most specialized to the individual, to the domain, or to the science question under study. The amount of data being operated on also tends to be larger on the repository-side of the workflow, smaller toward the client-side end products. We are using MapReduce to exploit this natural stratification, optimize efficiencies along the workflow chain, and provide a preliminary qualitative and quantitative assessment of MapReduce as a means of enabling server-side, distributed climate data analysis. MapReduce is a model for distributed storage and computation that seeks to improve efficiencies of the near-archive operations that initiate workflows. Simply put, MapReduce stores chunked data on disks with associated processors in such a way that operations on the chunked data can occur in parallel and return meaningfully aggregated results. While MapReduce has proven effective for large repositories of textual data, its use in data intensive science applications has been limited, because many scientific data sets are inherently complex, have high dimensionality, and use binary formats. We are using Apache's open-source Hadoop software implementation of MapReduce on top of the Hadoop Filesystem in our evaluation. Our analyses focus on soil moisture, precipitation, and atmospheric water-vapor, important classes of observation- and simulation-derived data products. The specific data sets being used in the evaluation include MERRA monthly precipitation and soil moisture products; the MODIS Atmospheres, 8-day global water-vapor product; and the SMOS 3-day global soil moisture product. To develop performance metrics, we focus on a small set of canonical near-archive, early-stage analytical operations that represent a common starting point in many analysis workflows in many domains: for example, avg, min, max, stddev operations over a given temporal and spatial extent. In this talk, we describe our evaluation of critical MapReduce attributes from these experiments, including an assessment of data preparation, ingest, and space complexities of creating repositories of the canonical data sets and the time and space complexities of server-side processing of the canonical operations along scaled ranges of their spatiotemporal parameters.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2011
- Bibcode:
- 2011AGUFMIN44A..08D
- Keywords:
-
- 1916 INFORMATICS / Data and information discovery;
- 1920 INFORMATICS / Emerging informatics technologies;
- 1932 INFORMATICS / High-performance computing