MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

19 May 2025

Abstract

The combination of Spiking Neural Networks(SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available atthis https URL.

View on arXiv

@article{hua2025_2505.14719,
  title={ MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion },
  author={ Wei Hua and Chenlin Zhou and Jibin Wu and Yansong Chua and Yangyang Shu },
  journal={arXiv preprint arXiv:2505.14719},
  year={ 2025 }
}

Comments on this paper