Better Sum Estimation via Weighted Sampling

28 October 2021

Abstract

Given a large set $U$ where each item $a\in U$ has weight $w(a)$ , we want to estimate the total weight $W=\sum_{a\in U} w(a)$ to within factor of $1\pm\varepsilon$ with some constant probability $>1/2$ . Since $n=|U|$ is large, we want to do this without looking at the entire set $U$ . In the traditional setting in which we are allowed to sample elements from $U$ uniformly, sampling $\Omega(n)$ items is necessary to provide any non-trivial guarantee on the estimate. Therefore, we investigate this problem in different settings: in the \emph{proportional} setting we can sample items with probabilities proportional to their weights, and in the \emph{hybrid} setting we can sample both proportionally and uniformly. These settings have applications, for example, in sublinear-time algorithms and distribution testing. Sum estimation in the proportional and hybrid setting has been considered before by Motwani, Panigrahy, and Xu [ICALP, 2007]. In their paper, they give both upper and lower bounds in terms of $n$ . Their bounds are near-matching in terms of $n$ , but not in terms of $\varepsilon$ . In this paper, we improve both their upper and lower bounds. Our bounds are matching up to constant factors in both settings, in terms of both $n$ and $\varepsilon$ . No lower bounds with dependency on $\varepsilon$ were known previously. In the proportional setting, we improve their $\tilde{O}(\sqrt{n}/\varepsilon^{7/2})$ algorithm to $O(\sqrt{n}/\varepsilon)$ . In the hybrid setting, we improve $\tilde{O}(\sqrt[3]{n}/ \varepsilon^{9/2})$ to $O(\sqrt[3]{n}/\varepsilon^{4/3})$ . Our algorithms are also significantly simpler and do not have large constant factors. We also investigate the previously unexplored setting where $n$ is unknown to the algorithm. Finally, we show how our techniques apply to the problem of edge counting in graphs.

View on arXiv

Comments on this paper