A Probabilistic Approach to Mining Massive Earth Science Data Sets
Abstract
Modern Earth science data sets are massive, complex and unwieldy. They are difficult to manipulate, let alone mine for unknown or unexpected structural features. An approach to the first problem is to reduce the data to manageable size in a way that preserves the statistical character of the original data, and work with the smaller, less complex ``compressed" data set. However, this does not address the issue of how to mine in a way that doesn't depend on knowing the data structure in the first place. In this talk we describe (1) a method of data reduction that lends itself to moderately agnostic data mining, and (2) some methods for mining the resulting compressed data to characterize large scale data structure. Specifically, we partition a massive data set into space-time regions (e.g. monthly, five degree grid cells), and replace the raw data in each grid cell with a discrete, multivariate probability distribution estimate. These distributions are the signatures of the physical processes generating the data, and data mining is the characterization and quantification of how those distributions evolve in time and space. Moreover, the aim is not only to characterize this evolution, but to explain it physically. In this talk we use data from JPL's Atmospheric Infrared Sounder Instrument to demonstrate the scientific utility of this approach.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2005
- Bibcode:
- 2005AGUFMIN33D..03B
- Keywords:
-
- 0300 ATMOSPHERIC COMPOSITION AND STRUCTURE;
- 0520 Data analysis: algorithms and implementation;
- 1640 Remote sensing (1855);
- 3252 Spatial analysis (0500);
- 4468 Probability distributions;
- heavy and fat-tailed (3265)