ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.04303
56
43

Vision-LSTM: xLSTM as Generic Vision Backbone

24 February 2025
Benedikt Alkin
M. Beck
Korbinian Poppel
Sepp Hochreiter
Johannes Brandstetter
    VLM
ArXivPDFHTML
Abstract

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

View on arXiv
@article{alkin2025_2406.04303,
  title={ Vision-LSTM: xLSTM as Generic Vision Backbone },
  author={ Benedikt Alkin and Maximilian Beck and Korbinian Pöppel and Sepp Hochreiter and Johannes Brandstetter },
  journal={arXiv preprint arXiv:2406.04303},
  year={ 2025 }
}
Comments on this paper