High-Dimensional Robust Mean Estimation with Untrusted Batches

24 February 2026

Maryam Aliakbarpour

Vladimir Braverman

Yuhan Liu

Junze Yin

FedML

ArXiv (abs)PDF HTML Github

Main:61 Pages

4 Figures

Bibliography:4 Pages

Abstract

We study high-dimensional mean estimation in a collaborative setting where data is contributed by $N$ users in batches of size $n$ . In this environment, a learner seeks to recover the mean $\mu$ of a true distribution $P$ from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an $\varepsilon$ -fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to $P$ , but deviate by a proximity parameter $\alpha$ .Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of $\sqrt{\alpha}$ , or (2) an $\alpha$ -fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily.We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate $O(\sqrt{\varepsilon/n} + \sqrt{d/nN} + \sqrt{\alpha})$ , demonstrating that while heterogeneity $\alpha$ represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of $1/\sqrt{n}$ due to the internal averaging afforded by the batch structure.

View on arXiv

Comments on this paper