ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10665
56
0

Small Vision-Language Models: A Survey on Compact Architectures and Techniques

9 March 2025
Nitesh Patnaik
Navdeep Nayak
Himani Bansal Agrawal
Moinak Chinmoy Khamaru
Gourav Bal
Saishree Smaranika Panda
Rishi Raj
Vishal Meena
Kartheek Vadlamani
    VLM
ArXivPDFHTML
Abstract

The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.

View on arXiv
@article{patnaik2025_2503.10665,
  title={ Small Vision-Language Models: A Survey on Compact Architectures and Techniques },
  author={ Nitesh Patnaik and Navdeep Nayak and Himani Bansal Agrawal and Moinak Chinmoy Khamaru and Gourav Bal and Saishree Smaranika Panda and Rishi Raj and Vishal Meena and Kartheek Vadlamani },
  journal={arXiv preprint arXiv:2503.10665},
  year={ 2025 }
}
Comments on this paper