We investigate the problem of identity testing for multidimensional histogram distributions. A distribution , where , is called a {-histogram} if there exists a partition of the domain into axis-aligned rectangles such that is constant within each such rectangle. Histograms are one of the most fundamental non-parametric families of distributions and have been extensively studied in computer science and statistics. We give the first identity tester for this problem with {\em sub-learning} sample complexity in any fixed dimension and a nearly-matching sample complexity lower bound. More specifically, let be an unknown -dimensional -histogram and be an explicitly given -histogram. We want to correctly distinguish, with probability at least , between the case that versus . We design a computationally efficient algorithm for this hypothesis testing problem with sample complexity . Our algorithm is robust to model misspecification, i.e., succeeds even if is only promised to be {\em close} to a -histogram. Moreover, for , we show a nearly-matching sample complexity lower bound of when . Prior to our work, the sample complexity of the case was well-understood, but no algorithm with sub-learning sample complexity was known, even for . Our new upper and lower bounds have interesting conceptual implications regarding the relation between learning and testing in this setting.
View on arXiv