23
1

Impossibility of phylogeny reconstruction from kk-mer counts

Abstract

We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed kk no statistically consistent phylogeny estimation is possible from kk-mer counts of the leaf sequences alone. Formally, we establish that the joint leaf distributions of kk-mer counts on two distinct trees have total variation distance bounded away from 11 as the sequence length tends to infinity. That is, the two distributions cannot be distinguished with probability going to one in that asymptotic regime. Our results are information-theoretic: they imply an impossibility result for any reconstruction method using only kk-mer counts at the leaves.

View on arXiv
Comments on this paper