LittleBit: Ultra Low-Bit Quantization via Latent Factorization

30 May 2025

Main:9 Pages

9 Figures

Bibliography:4 Pages

15 Tables

Appendix:11 Pages

Abstract

Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31 $\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for stable quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. This establishes a superior size-performance trade-off, with kernel-level benchmarks indicating potential for a 5 $\times$ speedup compared to FP16. LittleBit paves the way for deploying powerful LLMs in resource-constrained environments.

View on arXiv

@article{lee2025_2506.13771,
  title={ LittleBit: Ultra Low-Bit Quantization via Latent Factorization },
  author={ Banseok Lee and Dongkyu Kim and Youngcheon You and Youngmin Kim },
  journal={arXiv preprint arXiv:2506.13771},
  year={ 2025 }
}

Comments on this paper