Tackling the 2nd V: Big Data, Variety and the Need for Representation Consistency
Abstract
While Big Data technologies are transforming our ability to analyze ever larger volumes of Earth science data, practical constraints continue to limit our ability to compare data across datasets from different sources in an efficient and robust manner. Within a single data collection, invariants such as file format, grid type, and spatial resolution greatly simplify many types of analysis (often implicitly). However, when analysis combines data across multiple data collections, researchers are generally required to implement data transformations (i.e., "data preparation") to provide appropriate invariants. These transformation include changing of file formats, ingesting into a database, and/or regridding to a common spatial representation, and they can either be performed once, statically, or each time the data is accessed. At the very least, this process is inefficient from the perspective of the community as each team selects its own representation and privately implements the appropriate transformations. No doubt there are disadvantages to any "universal" representation, but we posit that major benefits would be obtained if a suitably flexible spatial representation could be standardized along with tools for transforming to/from that representation. We regard this as part of the historic trend in data publishing. Early datasets used ad hoc formats and lacked metadata. As better tools evolved, published data began to use standardized formats (e.g., HDF and netCDF) with attached metadata. We propose that the modern need to perform analysis across data sets should drive a new generation of tools that support a standardized spatial representation. More specifically, we propose the hierarchical triangular mesh (HTM) as a suitable "generic" resolution that permits standard transformations to/from native representations in use today, as well as tools to convert/regrid existing datasets onto that representation.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2016
- Bibcode:
- 2016AGUFMIN11B1616C
- Keywords:
-
- 1914 Data mining;
- INFORMATICSDE: 1932 High-performance computing;
- INFORMATICSDE: 1942 Machine learning;
- INFORMATICSDE: 1980 Spatial analysis and representation;
- INFORMATICS