9
38

Testing Closeness With Unequal Sized Samples

Abstract

We consider the problem of closeness testing for two discrete distributions in the practically relevant setting of \emph{unequal} sized samples drawn from each of them. Specifically, given a target error parameter ε>0\varepsilon > 0, m1m_1 independent draws from an unknown distribution p,p, and m2m_2 draws from an unknown distribution qq, we describe a test for distinguishing the case that p=qp=q from the case that pq1ε||p-q||_1 \geq \varepsilon. If pp and qq are supported on at most nn elements, then our test is successful with high probability provided m1n2/3/ε4/3m_1\geq n^{2/3}/\varepsilon^{4/3} and m2=Ω(max{nm1ε2,nε2});m_2 = \Omega(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}); we show that this tradeoff is optimal throughout this range, to constant factors. These results extend the recent work of Chan et al. who established the sample complexity when the two samples have equal sizes, and tightens the results of Acharya et al. by polynomials factors in both nn and ε\varepsilon. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on nn states up to a logn\log n factor that uses O~(n3/2τmix)\tilde{O}(n^{3/2} \tau_{mix}) queries to a "next node" oracle, improving upon the O~(n5/3τmix)\tilde{O}(n^{5/3}\tau_{mix}) query algorithm of Batu et al. Finally, we note that the core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data.

View on arXiv
Comments on this paper