LittleBit: Ultra Low-Bit Quantization via Latent Factorization
- MQ

Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31 memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for stable quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. This establishes a superior size-performance trade-off, with kernel-level benchmarks indicating potential for a 5 speedup compared to FP16. LittleBit paves the way for deploying powerful LLMs in resource-constrained environments.
View on arXiv@article{lee2025_2506.13771, title={ LittleBit: Ultra Low-Bit Quantization via Latent Factorization }, author={ Banseok Lee and Dongkyu Kim and Youngcheon You and Youngmin Kim }, journal={arXiv preprint arXiv:2506.13771}, year={ 2025 } }