141

Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints

Seng Pei Liew
Kenta Shinzato
Yuyang Dong
Main:3 Pages
6 Figures
Bibliography:3 Pages
8 Tables
Appendix:4 Pages
Abstract

Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters (NtotalN_{total}) and expert sparsity (s:=nexp/ntopks:=n_{exp}/n_{topk}).Moreover, nexpn_{exp} and ntopkn_{topk} do not "cancel out" within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes NtotalN_{total} while minimizing ss (maximizing ntopkn_{topk}) and nexpn_{exp} under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.

View on arXiv
Comments on this paper