QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives

Computation-intensive tensor operators constitute over 90\% of the computations in Large Language Models (LLMs) and Deep Neuralthis http URLand efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacksthis http URLexcel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to performance improvement. Even compared with human experts, QiMeng-TensorOp could reach of OpenBLAS on RISC-V CPUs, and of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by compared with human experts.
View on arXiv@article{zhang2025_2505.06302, title={ QiMeng-TensorOp: Automatically Generating High-Performance Tensor Operators with Hardware Primitives }, author={ Xuzhi Zhang and Shaohui Peng and Qirui Zhou and Yuanbo Wen and Qi Guo and Ruizhi Chen and Xinguo Zhu and Weiqiang Xiong and Haixin Chen and Congying Ma and Ke Gao and Chen Zhao and Yanjun Wu and Yunji Chen and Ling Li }, journal={arXiv preprint arXiv:2505.06302}, year={ 2025 } }