Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

17 April 2025

Abstract

Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.

View on arXiv

@article{ding2025_2504.12984,
  title={ Tilus: A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving },
  author={ Yaoyao Ding and Bohan Hou and Xiao Zhang and Allan Lin and Tianqi Chen and Cody Yu Hao and Yida Wang and Gennady Pekhimenko },
  journal={arXiv preprint arXiv:2504.12984},
  year={ 2025 }
}

Comments on this paper