v1v2v3 (latest)

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer

20 March 2025

Richie Li

Sicheng Chen

ArXiv (abs)PDF HTML

Main:6 Pages

5 Figures

Bibliography:1 Pages

2 Tables

Abstract

Transformer-based large language models (LLMs) rely heavily on intensive matrix multiplications for attention and feed-forward layers, with the Q, K, and V linear projections in the Multi-Head Self-Attention (MHA) module constituting a decisive performance bottleneck. In this work, we introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA that not only addresses this challenge but sets a new standard for efficiency and performance. Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine that together deliver unparalleled speed and energy efficiency. Integrated with DistilBERT for Q, K, and V projections, our accelerator achieves an unequivocal 7x speedup over ARM CPU implementations (PyTorch) and an extraordinary 200x improvement over naive NumPy, reaching a throughput of up to 3.1~GFLOPs for matrix multiplications on (64,768) x (768,3072) matrices while operating at a conservative 100 MHz. These results decisively demonstrate the transformative potential of FPGA-based acceleration for critical Transformer operations, paving the way for scalable and energy-efficient deep learning inference on edge devices.

View on arXiv

@article{li2025_2503.16731,
  title={ Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer },
  author={ Richie Li and Sicheng Chen },
  journal={arXiv preprint arXiv:2503.16731},
  year={ 2025 }
}

Comments on this paper