v1v2 (latest)

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits

20 January 2026

Aryan Karmore

MoE

ArXiv (abs)PDF HTML

Main:7 Pages

6 Figures

Bibliography:2 Pages

2 Tables

Abstract

Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d^2 + N \cdot d \log d)$ memory,sub-linear in the number of experts. The key insight: training these rotations with quantization reduces activation outliers and stabilizes extreme low bit training, where static methods collapse. Across language modeling benchmarks, ButterflyMoE achieves 150 $\times$ memory reduction at 256 experts with negligible accuracy loss. ButterflyMoE allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

View on arXiv

Comments on this paper