Revealing the Shape of Genome Space via K-mer Topology
Abstract
Despite decades of effort, understanding the shape of genome space in biology remains a challenge due to the similarity, variability, diversity, and plasticity of evolutionary relationships among species, genes, or other biological entities. We present a k-mer topology method, the first of its kind, to delineate the shape of the genome space. K-mer topology examines the topological persistence and the evolution of the homotopic shape of the sequences of k nucleotides in species, organisms, and genes using persistent Laplacians, a new multiscale combinatorial approach. We also propose a topological genetic distance between species by their topological invariants and non-harmonic spectra over scales. This new metric defines the topological phylogenetic trees of genomes, facilitating species classification and clustering. K-mer topology substantially outperforms state-of-the-art methods on a variety of benchmark datasets, including mammalian mitochondrial genomes, Rhinovirus, SARS-CoV-2 variants, Ebola virus, Hepatitis E virus, Influenza hemagglutinin genes, and whole bacterial genomes. K-mer topology reveals the intrinsic shapes of the genome space and can be directly applied to the rational design of viral vaccines.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.20202
- Bibcode:
- 2024arXiv241220202H
- Keywords:
-
- Quantitative Biology - Genomics;
- Mathematics - Algebraic Topology
- E-Print:
- 51 pages, 16 figures