Machine Learning Framework for the Discovery of Multivariate Associations in Disparate Biological Datasets
Abstract
Analyzing microbial data generated from complex environmental samples poses unique challenges. The multiple analytic platforms that are available each have distinct measurement capabilities, as well as differences in run-time variability and data characteristics (e.g., frequency of missing values). Additionally, high throughput -omics technologies (e.g., genomics, proteomics) generate data at different scales of the biological system. These issues make data integration and multi-omics quantitative models especially challenging, requiring that data analysis techniques vary across data types. In particular, traditional metrics of association, such as correlation between measured values within and across datasets, are often ineffective. We present a novel machine learning framework for the discovery of multivariate associations within and across disparate biological datasets where the quantitative nature of the data is leveraged. We have applied this methodology to an environmental microbiology study with genomics and metabolomics data, after first demonstrating the limitations of traditional association metrics. The approach involves pre-filtering of observed values, based on univariate variance and utility, followed by a random forest model conditional on experimental treatment group using a cross-validation paradigm. Finally, a novel metric of prediction efficacy and variable importance is calculated to identify and rank associations within and between biological variables from the two datasets. This framework is presented in the case of a finding associations between genomics and metabolomics datasets, however it provides the flexibility to modify the pre-filtering criteria and machine learning model to meet the requirements of the data structure.
- Publication:
-
AGU Fall Meeting Abstracts
- Pub Date:
- December 2019
- Bibcode:
- 2019AGUFM.B51G2320M
- Keywords:
-
- 0412 Biogeochemical kinetics and reaction modeling;
- BIOGEOSCIENCES;
- 0414 Biogeochemical cycles;
- processes;
- and modeling;
- BIOGEOSCIENCES;
- 0428 Carbon cycling;
- BIOGEOSCIENCES;
- 0465 Microbiology: ecology;
- physiology and genomics;
- BIOGEOSCIENCES