40

High-Dimensional Robust Mean Estimation with Untrusted Batches

Maryam Aliakbarpour
Vladimir Braverman
Yuhan Liu
Junze Yin
Main:61 Pages
4 Figures
Bibliography:4 Pages
Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by NN users in batches of size nn. In this environment, a learner seeks to recover the mean μ\mu of a true distribution PP from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an ε\varepsilon-fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to PP, but deviate by a proximity parameter α\alpha.Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of α\sqrt{\alpha}, or (2) an α\alpha-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily.We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate O(ε/n+d/nN+α)O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrt{\alpha}), demonstrating that while heterogeneity α\alpha represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of 1/n1/\sqrt{n} due to the internal averaging afforded by the batch structure.

View on arXiv
Comments on this paper