ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17674
75
0

SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding

23 May 2025
Xuerui Qiu
Peixi Wu
Yaozhi Wen
Shaowei Gu
Yuqi Pan
Xinhao Luo
Bo Xu
Guoqi Li
    VLM
ArXivPDFHTML
Abstract

Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing SNNs still exhibit a significant performance gap compared to Artificial Neural Networks (ANNs) due to inadequate pre-training strategies. These limitations manifest as restricted generalization ability, task specificity, and a lack of multimodal understanding, particularly in challenging tasks such as multimodal question answering and zero-shot 3D classification. To overcome these challenges, we propose a Spike-based Vision-Language (SVL) pretraining framework that empowers SNNs with open-world 3D understanding while maintaining spike-driven efficiency. SVL introduces two key components: (i) Multi-scale Triple Alignment (MTA) for label-free triplet-based contrastive learning across 3D, image, and text modalities, and (ii) Re-parameterizable Vision-Language Integration (Rep-VLI) to enable lightweight inference without relying on large text encoders. Extensive experiments show that SVL achieves a top-1 accuracy of 85.4% in zero-shot 3D classification, surpassing advanced ANN models, and consistently outperforms prior SNNs on downstream tasks, including 3D classification (+6.1%), DVS action recognition (+2.1%), 3D detection (+1.1%), and 3D segmentation (+2.1%) with remarkable efficiency. Moreover, SVL enables SNNs to perform open-world 3D question answering, sometimes outperforming ANNs. To the best of our knowledge, SVL represents the first scalable, generalizable, and hardware-friendly paradigm for 3D open-world understanding, effectively bridging the gap between SNNs and ANNs in complex open-world understanding tasks. Code is availablethis https URL.

View on arXiv
@article{qiu2025_2505.17674,
  title={ SVL: Spike-based Vision-language Pretraining for Efficient 3D Open-world Understanding },
  author={ Xuerui Qiu and Peixi Wu and Yaozhi Wen and Shaowei Gu and Yuqi Pan and Xinhao Luo and Bo XU and Guoqi Li },
  journal={arXiv preprint arXiv:2505.17674},
  year={ 2025 }
}
Comments on this paper