We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution over elements, an explicitly given distribution , and parameters , we wish to distinguish, {\em with probability at least }, whether the distributions are identical versus -far in total variation distance. Most prior work focused on the case that , for which the sample complexity of identity testing is known to be . Given such an algorithm, one can achieve arbitrarily small values of via black-box amplification, which multiplies the required number of samples by . We show that black-box amplification is suboptimal for any , and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is \[ \Theta\left( \frac{1}{\epsilon^2}\left(\sqrt{n \log(1/\delta)} + \log(1/\delta) \right)\right) \] for any , and . For the special case of uniformity testing, where the given distribution is the uniform distribution over the domain, our new tester is surprisingly simple: to test whether versus , we simply threshold , where is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of and .
View on arXiv