ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.03460
33
1

Approximation Rates and VC-Dimension Bounds for (P)ReLU MLP Mixture of Experts

5 February 2024
Anastasis Kratsios
Haitz Sáez de Ocáriz Borde
Takashi Furuya
Marc T. Law
    MoE
ArXivPDFHTML
Abstract

Mixture-of-Experts (MoEs) can scale up beyond traditional deep learning models by employing a routing strategy in which each input is processed by a single "expert" deep learning model. This strategy allows us to scale up the number of parameters defining the MoE while maintaining sparse activation, i.e., MoEs only load a small number of their total parameters into GPU VRAM for the forward pass depending on the input. In this paper, we provide an approximation and learning-theoretic analysis of mixtures of expert MLPs with (P)ReLU activation functions. We first prove that for every error level ε>0\varepsilon>0ε>0 and every Lipschitz function f:[0,1]n→Rf:[0,1]^n\to \mathbb{R}f:[0,1]n→R, one can construct a MoMLP model (a Mixture-of-Experts comprising of (P)ReLU MLPs) which uniformly approximates fff to ε\varepsilonε accuracy over [0,1]n[0,1]^n[0,1]n, while only requiring networks of O(ε−1)\mathcal{O}(\varepsilon^{-1})O(ε−1) parameters to be loaded in memory. Additionally, we show that MoMLPs can generalize since the entire MoMLP model has a (finite) VC dimension of O~(Lmax⁡{nL,JW})\tilde{O}(L\max\{nL,JW\})O~(Lmax{nL,JW}), if there are LLL experts and each expert has a depth and width of JJJ and WWW, respectively.

View on arXiv
Comments on this paper