Impossibility of phylogeny reconstruction from $k$-mer counts
Abstract
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of $k$-mer count information, such as block techniques developed in previous theoretical work.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2020
- DOI:
- 10.48550/arXiv.2010.14460
- arXiv:
- arXiv:2010.14460
- Bibcode:
- 2020arXiv201014460F
- Keywords:
-
- Mathematics - Probability;
- Computer Science - Computational Engineering;
- Finance;
- and Science;
- Mathematics - Statistics Theory;
- Quantitative Biology - Populations and Evolution
- E-Print:
- 35 pages, 4 figures