25
34

Learning Discrete Distributions from Untrusted Batches

Abstract

We consider the problem of learning a discrete distribution in the presence of an ϵ\epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, pp, and each data source provides a batch of k\ge k samples, with the guarantee that at least a (1ϵ)(1-\epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most η\eta from pp. We make no assumptions on the data provided by the remaining ϵ\epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1ϵ)(1-\epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, nn, but polynomial in kk, 1/ϵ1/\epsilon and 1/η1/\eta that takes O((n+k)/ϵ2)O((n+k)/\epsilon^2) batches and recovers pp to error O(η+ϵ/k)O(\eta+\epsilon/\sqrt{k}). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the η=0\eta = 0 setting and also achieves an O(ϵ/k)O(\epsilon/\sqrt{k}) recover guarantee, though it runs in poly((nk)k)\mathrm{poly}((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing 1\ell_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.

View on arXiv
Comments on this paper