66
0

Efficient Large Language Model Inference with Neural Block Linearization

Main:9 Pages
7 Figures
Bibliography:5 Pages
22 Tables
Appendix:20 Pages
Abstract

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.

View on arXiv
@article{erdogan2025_2505.21077,
  title={ Efficient Large Language Model Inference with Neural Block Linearization },
  author={ Mete Erdogan and Francesco Tonin and Volkan Cevher },
  journal={arXiv preprint arXiv:2505.21077},
  year={ 2025 }
}
Comments on this paper