30
16

Testing Identity of Multidimensional Histograms

Abstract

We investigate the problem of identity testing for multidimensional histogram distributions. A distribution p:DR+p: D \rightarrow \mathbb{R}_+, where DRdD \subseteq \mathbb{R}^d, is called a kk-histogram if there exists a partition of the domain into kk axis-aligned rectangles such that pp is constant within each such rectangle. Histograms are one of the most fundamental nonparametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. In more detail, let qq be an unknown dd-dimensional kk-histogram distribution in fixed dimension dd, and pp be an explicitly given dd-dimensional kk-histogram. We want to correctly distinguish, with probability at least 2/32/3, between the case that p=qp = q versus pq1ϵ\|p-q\|_1 \geq \epsilon. We design an algorithm for this hypothesis testing problem with sample complexity O((k/ϵ2)2d/2log2.5d(k/ϵ))O((\sqrt{k}/\epsilon^2) 2^{d/2} \log^{2.5 d}(k/\epsilon)) that runs in sample-polynomial time. Our algorithm is robust to model misspecification, i.e., succeeds even if qq is only promised to be {\em close} to a kk-histogram. Moreover, for k=2Ω(d)k = 2^{\Omega(d)}, we show a sample complexity lower bound of (k/ϵ2)Ω(log(k)/d)d1(\sqrt{k}/\epsilon^2) \cdot \Omega(\log(k)/d)^{d-1} when d2d\geq 2. That is, for any fixed dimension dd, our upper and lower bounds are nearly matching. Prior to our work, the sample complexity of the d=1d=1 case was well-understood, but no algorithm with sub-learning sample complexity was known, even for d=2d=2. Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.