ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.10479
169
128
v1v2v3 (latest)

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

14 April 2025
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
Lixin Gu
Yuchen Duan
H. Tian
Weijie Su
Jie Shao
Zhangwei Gao
Erfei Cui
Yue Cao
Yangzhou Liu
Xingguang Wei
Hongjie Zhang
Haomin Wang
Wenyuan Xu
Hao Li
Jiahao Wang
Dengnian Chen
Songze Li
Yinan He
Tan Jiang
Jiapeng Luo
Yi Wang
Conghui He
Botian Shi
Xinsong Zhang
Wenqi Shao
Junjun He
Yingtong Xiong
Wenwen Qu
Peng Sun
Penglong Jiao
Han Lv
Lijun Wu
Kai Zhang
Huipeng Deng
Jiaye Ge
Kai Chen
Limin Wang
Min Dou
Lewei Lu
X. Zhu
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
    MLLMVLM
ArXiv (abs)PDFHTML
Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

View on arXiv
@article{zhu2025_2504.10479,
  title={ InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models },
  author={ Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen Deng and Songze Li and Yinan He and Tan Jiang and Jiapeng Luo and Yi Wang and Conghui He and Botian Shi and Xingcheng Zhang and Wenqi Shao and Junjun He and Yingtong Xiong and Wenwen Qu and Peng Sun and Penglong Jiao and Han Lv and Lijun Wu and Kaipeng Zhang and Huipeng Deng and Jiaye Ge and Kai Chen and Limin Wang and Min Dou and Lewei Lu and Xizhou Zhu and Tong Lu and Dahua Lin and Yu Qiao and Jifeng Dai and Wenhai Wang },
  journal={arXiv preprint arXiv:2504.10479},
  year={ 2025 }
}
Comments on this paper