ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
- MoE
Linear memory scaling stores independent expert weight matrices requiring memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150 memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.
View on arXiv