22
v1v2 (latest)

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

Aryan Karmore
Main:7 Pages
6 Figures
Bibliography:2 Pages
2 Tables
Abstract

Linear memory scaling stores NN independent expert weight matrices requiring O(Nd2)\mathcal{O}(N \cdot d^2) memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields O(d2+Ndlogd)\mathcal{O}(d^2 + N \cdot d \log d) memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150×\times memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

View on arXiv
Comments on this paper