38
3

Near-Optimal Bounds for Testing Histogram Distributions

Abstract

We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, kk-histograms over [n][n], are probability distributions that are piecewise constant over a set of kk intervals. The histogram testing problem is the following: Given samples from an unknown distribution p\mathbf{p} on [n][n], we want to distinguish between the cases that p\mathbf{p} is a kk-histogram versus ε\varepsilon-far from any kk-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound. Specifically, we show that the histogram testing problem has sample complexity Θ~(nk/ε+k/ε2+n/ε2)\widetilde \Theta (\sqrt{nk} / \varepsilon + k / \varepsilon^2 + \sqrt{n} / \varepsilon^2).

View on arXiv
Comments on this paper