Eliminating Barriers to Climate Data with Columnar Storage Formats and Cloud Computing
Abstract
Public and private sector institutions are increasingly interested in employing modern "big data" techniques for discovering critical pieces of information in climate and meteorological datasets. Such datasets are often archived by national and international efforts using the NetCDF file format. However, despite its dominance in atmospheric and climate science, this format is critically limited by several inherent challenges in terms of network bandwidth, portability, and accessibility. Furthermore, it is not a file format that is widely recognized by data scientists outside the relatively narrow fields of climate and atmospheric science, physical oceanography, and climate modeling. Here we present a novel approach to storing and querying large climate and other geospatial datasets in columnar storage formats (ORC). This approach allows for the use of standard tools that are widely used at present to query, extract, and process large (>10TB) datasets. We further illustrate the usefulness of this approach using Amazon Athena, which can be used to run complex analysis or SQL queries against massive datasets in a parallel manner. In addition, parallelized access to the data is remarkably simple, enabling real-time queries of petabyte scale data in seconds to minutes as opposed to hours to days (as would be required by using NetCDF). This approach further eliminates the need to transfer data before processing, thus simultaneously cutting computational costs and accelerating user access to climate data information. Moreover, data scientists from other fields can immediately understand how to access and work with ORC data using common tools like Amazon Athena, Apache Spark and/or ODBC/JDBC driven analysis software. Finally, using SQL is straightforward for new users to learn and hence removes the barrier of entry that formats like NetCDF introduce to those inexperienced with programming.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2018
- Bibcode:
- 2018AGUFMIN31C0829S
- Keywords:
-
- 3305 Climate change and variability;
- ATMOSPHERIC PROCESSESDE: 0414 Biogeochemical cycles;
- processes;
- and modeling;
- BIOGEOSCIENCESDE: 1942 Machine learning;
- INFORMATICSDE: 1986 Statistical methods: Inferential;
- INFORMATICS