Cross-study analyses of microbial abundance using generalized common factor methods

doi:10.48550/arXiv.2303.15211

Cross-study analyses of microbial abundance using generalized common factor methods

By creating networks of biochemical pathways, communities of micro-organisms are able to modulate the properties of their environment and even the metabolic processes within their hosts. Next-generation high-throughput sequencing has led to a new frontier in microbial ecology, promising the ability to leverage the microbiome to make crucial advancements in the environmental and biomedical sciences. However, this is challenging, as genomic data are high-dimensional, sparse, and noisy. Much of this noise reflects the exact conditions under which sequencing took place, and is so significant that it limits consensus-based validation of study results. We propose an ensemble approach for cross-study exploratory analyses of microbial abundance data in which we first estimate the variance-covariance matrix of the underlying abundances from each dataset on the log scale assuming Poisson sampling, and subsequently model these covariances jointly so as to find a shared low-dimensional subspace of the feature space. By viewing the projection of the latent true abundances onto this common structure, the variation is pared down to that which is shared among all datasets, and is likely to reflect more generalizable biological signal than can be inferred from individual datasets. We investigate several ways of achieving this, and demonstrate that they work well on simulated and real metagenomic data in terms of signal retention and interpretability.

Publication:

arXiv e-prints

Pub Date:

March 2023

DOI:

10.48550/arXiv.2303.15211

arXiv:

arXiv:2303.15211

Bibcode:

2023arXiv230315211H

Keywords:

Statistics - Applications;
62P10

NASA/ADS

Cross-study analyses of microbial abundance using generalized common factor methods

Abstract