This work introduces novel training and post-training compression schemes to reduce external memory access during transformer model inference. Additionally, a new control flow mechanism, called dynamic batching, and a novel buffer architecture, termed a two-direction accessible register file, further reduce external memory access while improving hardware utilization.
View on arXiv@article{moon2025_2503.00322, title={ T-REX: A 68-567 μs/token, 0.41-3.95 μJ/token Transformer Accelerator with Reduced External Memory Access and Enhanced Hardware Utilization in 16nm FinFET }, author={ Seunghyun Moon and Mao Li and Gregory Chen and Phil Knag and Ram Krishnamurthy and Mingoo Seok }, journal={arXiv preprint arXiv:2503.00322}, year={ 2025 } }