46
1

Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models

Abstract

Advancements in Natural Language Processing are heavily reliant on the Transformer architecture, whose improvements come at substantial resource costs due to ever-growing model sizes. This study explores optimization techniques, including Quantization, Knowledge Distillation, and Pruning, focusing on energy and computational efficiency while retaining performance. Among standalone methods, 4-bit Quantization significantly reduces energy use with minimal accuracy loss. Hybrid approaches, like NVIDIA's Minitron approach combining KD and Structured Pruning, further demonstrate promising trade-offs between size reduction and accuracy retention. A novel optimization equation is introduced, offering a flexible framework for comparing various methods. Through the investigation of these compression methods, we provide valuable insights for developing more sustainable and efficient LLMs, shining a light on the often-ignored concern of energy efficiency.

View on arXiv
@article{wallace2025_2502.00046,
  title={ Optimization Strategies for Enhancing Resource Efficiency in Transformers & Large Language Models },
  author={ Tom Wallace and Naser Ezzati-Jivan and Beatrice Ombuki-Berman },
  journal={arXiv preprint arXiv:2502.00046},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.