MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available atthis https URL.
View on arXiv@article{hua2025_2505.14719, title={ MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion }, author={ Wei Hua and Chenlin Zhou and Jibin Wu and Yansong Chua and Yangyang Shu }, journal={arXiv preprint arXiv:2505.14719}, year={ 2025 } }