Robustness of Mixtures of Experts to Feature Noise
Dong Sun
Rahul Nittala
Rebekka Burkholz
- MoE
Main:8 Pages
10 Figures
Bibliography:3 Pages
7 Tables
Appendix:16 Pages
Abstract
Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.
View on arXivComments on this paper
