On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large, Sparse Networks

Our work in this paper is motivated by an elementary but also fundamental and highly practical observation -- that uncertainty in constructing a network graph , as an approximation (or estimate) of some true graph , manifests as errors in the status of (non)edges that must necessarily propagate to any summaries we seek. Mimicking the common practice of using plug-in estimates as proxies for , our goal is to characterize the distribution of the discrepency , in the specific case where is a subgraph count. In the empirically relevant setting of large, sparse graphs with low-rate measurement errors, we demonstrate under an independent and unbiased error model and for the specific case of counting edges that a Poisson-like regime maintains. Specifically, we show that the appropriate limiting distribution is a Skellam distribution, rather than a normal distribution. Next, because dependent errors typically can be expected when counting subgraphs in practice, either at the level of the edges themselves or due to overlap among subgraphs, we develop a parallel formalism for using the Skellam distribution in such cases. In particular, using Stein's method, we present a series of results leading to the quantification of the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by a Skellam. This formulation is general and likely of some independent interest. We then illustrate the use of these results in our original context of subgraph counts, where we examine (i) the case of counting edges, under a simple dependent error model, and (ii) the case of counting chains of length under an independent error model. We finish with a discussion of various open problems raised by our work.
View on arXiv