The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix by approximating it via a low-rank, low-precision decomposition as . Here, and are low rank factors, and the entries of , and are quantized. The model is compressed by substituting each layer with its decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, and are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. obtains this decomposition by formulating it as an optimization problem , where is the calibration data, and are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa- B/B and LlaMa- B models obtained using outperforms existing post-training LLM compression techniques in the regime of less than bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.
View on arXiv