Improving Pearson's chi-squared test: hypothesis testing of distributions -- optimally

Pearson's chi-squared test, from 1900, is the standard statistical tool for "hypothesis testing on distributions": namely, given samples from an unknown distribution that may or may not equal a hypothesis distribution , we want to return "yes" if and "no" if is far from . While the chi-squared test is easy to use, it has been known for a while that it is not "data efficient", it does not make the best use of its data. Precisely, for accuracy and confidence , and given samples from the unknown distribution , a tester should return "yes" with probability when , and "no" with probability when . The challenge is to find a tester with the \emph{best} tradeoff between , , and . We introduce a new tester, efficiently computable and easy to use, which we hope will replace the chi-squared tester in practical use. Our tester is found via a new non-convex optimization framework that essentially seeks to "find the tester whose Chernoff bounds on its performance are as good as possible". This tester is optimal, in that the number of samples needed by the tester is within factor of the samples needed by \emph{any} tester, even non-linear testers (for the setting: accuracy , confidence , and hypothesis ). We complement this algorithmic framework with matching lower bounds saying, essentially, that "our tester is instance-optimal, even to factors, to the degree that Chernoff bounds are tight". Our overall non-convex optimization framework extends well beyond the current problem and is of independent interest.
View on arXiv