The Census of Deep Life: Metadata Then and Now
Abstract
The Census of Deep Life (CoDL), a central pillar of the Deep Carbon Observatory (DCO), aims to investigate the diversity, distribution and biogeography of the subsurface biosphere. Over two dozen marine and terrestrial subsurface habitats were surveyed with over three hundred samples. The associated metadata have the potential to identify the environmental features that drive microbial diversity within the subsurface. An analysis of the CoDL datasets revealed some metadata deficiencies limiting the exploration of potential correlations among microbial and environmental features. Deficiencies within the metadata included: sparse metadata measurements and/or reporting, missing features, inconsistent units, and mutual exclusivity. Using an interdisciplinary team, we establish a workflow to improve the quality of the CoDL metadata and further enhance the analysis of globally surveyed subsurface ecosystems.
The workflow included (1) manual inference and accession of metadata from project PIs; (2) quality improvements in which unit mismatches were identified with outlier analysis; (3) computation of inclusive features from mutually exclusive features; (4) data imputation using visual analytics techniques. For example, a more inclusive feature, the in situ pressure, was calculated directly by combining several mutually exclusive "depth" and "elevation" features, and a preliminary comparison between computed pressures and the few reported values showed equivalence within a small margin. Further, imputation of missing values is being facilitated via correlation, correlogram, and distribution visualizations created from the existing metadata. Initial analyses reveal a bifurcation between marine and terrestrial samples, and suggest that some related features (e.g. salinity, conductivity, sodium, chloride) may be imputed from other features. Such imputation could lead to an expanded and more complete set of metadata features. Other methods used to assess the status of the metadata include Synthetic Minority Over-sampling Technique (SMOTE), and clustering. These methods were attempted without success. As we work to augment the CoDL metadata, enriching existing metadata within the wider community is aided by sharing our own methods and experiences.- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2018
- Bibcode:
- 2018AGUFMIN53C0629R
- Keywords:
-
- 1916 Data and information discovery;
- INFORMATICSDE: 1930 Data and information governance;
- INFORMATICSDE: 1946 Metadata;
- INFORMATICSDE: 1976 Software tools and services;
- INFORMATICS