Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

23 May 2025

Main:9 Pages

5 Figures

Bibliography:4 Pages

11 Tables

Appendix:3 Pages

Abstract

The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics, with continuous fusion via a gating mechanism. We also present a new benchmark by merging RWF-2000, RLVS, and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Our model achieves state-of-the-art performance on this benchmark offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, real-time surveillance violence detection.

View on arXiv

@article{senadeera2025_2506.03162,
  title={ Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection },
  author={ Damith Chamalke Senadeera and Xiaoyun Yang and Dimitrios Kollias and Gregory Slabaugh },
  journal={arXiv preprint arXiv:2506.03162},
  year={ 2025 }
}

Comments on this paper