FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

28 May 2025

Main:9 Pages

6 Figures

Bibliography:3 Pages

9 Tables

Appendix:2 Pages

Abstract

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for training and inference. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads contribute are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, a proof-of-concept kernel for accelerating single-batch inference for transformer-based large language models. Across various model sizes and quantizations settings, we observe nontrivial speedups compared to existing state-of-the-art inference kernels.

View on arXiv

@article{nrusimha2025_2505.22758,
  title={ FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference },
  author={ Aniruddha Nrusimha and William Brandon and Mayank Mishra and Yikang Shen and Rameswar Panda and Jonathan Ragan-Kelley and Yoon Kim },
  journal={arXiv preprint arXiv:2505.22758},
  year={ 2025 }
}

Comments on this paper