v1v2 (latest)

Jina-VLM: Small Multilingual Vision Language Model

3 December 2025

Andreas Koukounas

Georgios Mastrapas

Florian Hönicke

Sedigheh Eslami

Guillaume Roncari

Scott Martens

Han Xiao

MLLM

ArXiv (abs)PDF HTML HuggingFace (12 upvotes)Github (3476★)

Main:8 Pages

12 Figures

Bibliography:3 Pages

15 Tables

Appendix:7 Pages

Abstract

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released atthis https URL.

View on arXiv

Comments on this paper