ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.18886
22
10

Compressing Large Language Models using Low Rank and Low Precision Decomposition

29 May 2024
R. Saha
Naomi Sagan
Varun Srivastava
Andrea J. Goldsmith
Mert Pilanci
    MQ
ArXivPDFHTML
Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA\rm CALDERACALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W\mathbf{W}W by approximating it via a low-rank, low-precision decomposition as W≈Q+LR\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}W≈Q+LR. Here, L\mathbf{L}L and R\mathbf{R}R are low rank factors, and the entries of Q\mathbf{Q}Q, L\mathbf{L}L and R\mathbf{R}R are quantized. The model is compressed by substituting each layer with its Q+LR\mathbf{Q} + \mathbf{L}\mathbf{R}Q+LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L\mathbf{L}L and R\mathbf{R}R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA\rm CALDERACALDERA obtains this decomposition by formulating it as an optimization problem min⁡Q,L,R∥(Q+LR−W)X⊤∥F2\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2minQ,L,R​∥(Q+LR−W)X⊤∥F2​, where X\mathbf{X}X is the calibration data, and Q,L,R\mathbf{Q}, \mathbf{L}, \mathbf{R}Q,L,R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA\rm CALDERACALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-222 777B/707070B and LlaMa-333 888B models obtained using CALDERA\rm CALDERACALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.52.52.5 bits per parameter. The implementation is available at: \href{https://github.com/pilancilab/caldera}{https://github.com/pilancilab/caldera}.

View on arXiv
Comments on this paper