Generalized Error Exponents for Sparse Sample Goodness of Fit Tests

6 April 2012

Abstract

We investigate the sparse sample goodness-of-fit problem, where the number of samples $n$ is smaller than the size of the alphabet $m$ . The goal of this work is to find an appropriate criterion to analyze statistical tests in this setting. A suitable model for analysis is the high-dimensional model in which both $n$ and $m$ tend to infinity, and $n=o(m)$ . We propose a new performance criterion based on large deviation analysis, which generalizes the classical error exponent applicable for large sample problems (in which $m=O(n)$ ). This new criterion provides insights that are not available from asymptotic consistency or CLT analysis. The main results are: (i) The best achievable probability of error $P_e$ decays as $-\log(P_e)=(n^2/m)(1+o(1))J$ for some $J>0$ . (ii) A well-known coincidence-based test attains the optimal generalized error exponent. (iii) The widely used Pearson's chi-square test has J=0. (iv) The contributions (i)-(iii) are established under the assumption that the distribution under the null hypothesis is uniform. For the non-uniform case, a new test is proposed, with a non-zero generalized error exponent.

View on arXiv

Comments on this paper