High-Dimensional Robust Mean Estimation with Untrusted Batches
- FedML
We study high-dimensional mean estimation in a collaborative setting where data is contributed by users in batches of size . In this environment, a learner seeks to recover the mean of a true distribution from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an -fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to , but deviate by a proximity parameter .Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of , or (2) an -fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily.We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate , demonstrating that while heterogeneity represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of due to the internal averaging afforded by the batch structure.
View on arXiv