Impossibility of phylogeny reconstruction from -mer counts

We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed no statistically consistent phylogeny estimation is possible from -mer counts of the leaf sequences alone. Formally, we establish that the joint leaf distributions of -mer counts on two distinct trees have total variation distance bounded away from as the sequence length tends to infinity. That is, the two distributions cannot be distinguished with probability going to one in that asymptotic regime. Our results are information-theoretic: they imply an impossibility result for any reconstruction method using only -mer counts at the leaves.
View on arXiv