Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.10479
Cited By
v1
v2
v3 (latest)
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
14 April 2025
Jinguo Zhu
Weiyun Wang
Zhe Chen
Ziwei Liu
Shenglong Ye
Lixin Gu
Yuchen Duan
H. Tian
Weijie Su
Jie Shao
Zhangwei Gao
Erfei Cui
Yue Cao
Yangzhou Liu
Xingguang Wei
Hongjie Zhang
Haomin Wang
Wenyuan Xu
Hao Li
Jiahao Wang
Dengnian Chen
Songze Li
Yinan He
Tan Jiang
Jiapeng Luo
Yi Wang
Conghui He
Botian Shi
Xinsong Zhang
Wenqi Shao
Junjun He
Yingtong Xiong
Wenwen Qu
Peng Sun
Penglong Jiao
Han Lv
Lijun Wu
Kai Zhang
Huipeng Deng
Jiaye Ge
Kai Chen
Limin Wang
Min Dou
Lewei Lu
X. Zhu
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models"
50 / 161 papers shown
Title
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Sihan Yang
Runsen Xu
Yiman Xie
Sizhe Yang
Mo Li
...
Haodong Duan
Xiangyu Yue
Dahua Lin
Tai Wang
Jiangmiao Pang
VLM
LRM
49
1
0
29 May 2025
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
Y. Liu
Kun Ouyang
Haoning Wu
Yi Liu
Lin Sui
Xinhao Li
Y. Zhong
Y. Charles
Xinyu Zhou
Xu Sun
VLM
LRM
64
0
0
29 May 2025
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Sungjune Park
Hyunjun Kim
Junho Kim
S. T. Kim
Y. Ro
LRM
106
0
0
29 May 2025
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Tingyu Song
Tongyan Hu
Guo Gan
Yilun Zhao
73
0
0
29 May 2025
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Yi Ding
Ruqi Zhang
ReLM
LRM
VLM
91
0
0
28 May 2025
DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response
Junjue Wang
Weihao Xuan
Heli Qi
Zhihao Liu
Kunyi Liu
...
Hongruixuan Chen
Jian Song
J. Xia
Zhuo Zheng
Naoto Yokoya
43
0
0
27 May 2025
DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving
Muxi Diao
Lele Yang
Hongbo Yin
Zhexu Wang
Yejie Wang
Daxin Tian
Kongming Liang
Zhanyu Ma
VLM
LRM
55
0
0
27 May 2025
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models
Weihao Xuan
Qingcheng Zeng
Heli Qi
Junjue Wang
Naoto Yokoya
46
0
0
26 May 2025
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
Chao Huang
Benfeng Wang
Jie Wen
Chengliang Liu
Wei Wang
Li Shen
Xiaochun Cao
LRM
59
0
0
26 May 2025
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness
Yunlong Tang
Pinxin Liu
Mingqian Feng
Zhangyun Tan
Rui Mao
...
Hang Hua
Ali Vosoughi
Luchuan Song
Zeliang Zhang
Chenliang Xu
LRM
41
0
0
26 May 2025
InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts
Minzhi Lin
Tianchi Xie
Mengchen Liu
Yilin Ye
C. L. Philip Chen
Shixia Liu
60
0
0
25 May 2025
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Sicheng Feng
Song Wang
Shuyi Ouyang
Lingdong Kong
Zikai Song
Jianke Zhu
Huan Wang
Xinchao Wang
LRM
89
0
0
24 May 2025
MLLMs are Deeply Affected by Modality Bias
Xu Zheng
Chenfei Liao
Yuqian Fu
Kaiyu Lei
Yuanhuiyi Lyu
...
Yu Jiang
N. Sebe
Dacheng Tao
Luc Van Gool
Xuming Hu
37
0
0
24 May 2025
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection
Zhenglin Huang
Tianxiao Li
Xiangtai Li
Haiquan Wen
Yiwei He
...
Hao Fei
Xi Yang
Xiaowei Huang
Bei Peng
Guangliang Cheng
67
0
0
24 May 2025
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models
Duo Li
Zuhao Yang
Shijian Lu
VLM
71
0
0
24 May 2025
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications
Ancheng Xu
Zhihao Yang
Junlin Li
Guanghu Yuan
Longze Chen
...
Zhen Qin
Hengyun Chang
Hamid Alinejad-Rokny
Bo Zheng
Min Yang
AAML
22
0
0
23 May 2025
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Jingjing Jiang
Chongjie Si
Jun Luo
Hanwang Zhang
Chao Ma
168
0
0
23 May 2025
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
Haoyu Sun
Huichen Will Wang
Jiawei Gu
Linjie Li
Yu Cheng
VLM
70
0
0
23 May 2025
Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
Siqu Ou
Hongcheng Liu
Pingjie Wang
Yusheng Liao
Chuan Xuan
Yanfeng Wang
Yu Wang
LRM
18
0
0
22 May 2025
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Kuicai Dong
Yujing Chang
Shijie Huang
Yasheng Wang
Ruiming Tang
Yong Liu
64
1
0
22 May 2025
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
Meng-Hao Guo
Xuanyu Chu
Qianrui Yang
Zhe-Han Mo
Yiqing Shen
...
Kiyohiro Nakayama
Zhengyang Geng
Houwen Peng
Han Hu
Shi-Min Hu
LRM
181
0
0
22 May 2025
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Benjamin Schneider
Dongfu Jiang
Chao Du
Tianyu Pang
Wenhu Chen
VLM
37
0
0
22 May 2025
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Xuecheng Wu
Jiaxing Liu
Danlei Huang
Xiaoyu Li
Yifan Wang
Chen Chen
Liya Ma
Xuezhi Cao
Junxiao Xue
LRM
95
0
0
20 May 2025
PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
He Zhu
Junyou Su
Minxin Chen
Wen Wang
Yijie Deng
Guanhua Chen
Wenjia Zhang
189
0
0
20 May 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma
Weiming Ren
Yiming Jia
Zhuofeng Li
Ping Nie
Ge Zhang
Wenhu Chen
68
1
0
20 May 2025
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
Matteo Merler
Nicola Dainese
Minttu Alakuijala
Giovanni Bonetta
Pietro Ferrazzi
Yu Tian
Bernardo Magnini
Pekka Marttinen
LM&Ro
VLM
96
0
0
19 May 2025
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Maoyuan Ye
Jing Zhang
Juhua Liu
Bo Du
Dacheng Tao
LRM
160
0
0
18 May 2025
MedSG-Bench: A Benchmark for Medical Image Sequences Grounding
Jingkun Yue
Siqi Zhang
Zinan Jia
Huihuan Xu
Zongbo Han
Xiaohong Liu
Guangyu Wang
VLM
57
0
0
17 May 2025
CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction
Jing Zou
Qingqiu Li
Chenyu Lian
Lihao Liu
Xiaohan Yan
Shujun Wang
Jing Qin
VLM
150
0
0
17 May 2025
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
Yansheng Qiu
Li Xiao
Zhaopan Xu
Pengfei Zhou
Zheng Wang
Kai Zhang
ELM
LRM
98
0
0
16 May 2025
MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
Chonghan Liu
Haoran Wang
Felix Henry
Pu Miao
Yajie Zhang
Yu Zhao
Peiran Wu
VLM
147
0
0
15 May 2025
Bias and Generalizability of Foundation Models across Datasets in Breast Mammography
Elodie Germani
Selin Türk Ilayda
Zeineddine Fatima
Mourad Charbel
Shadi Albarqouni
AI4CE
87
0
0
14 May 2025
MELLM: Exploring LLM-Powered Micro-Expression Understanding Enhanced by Subtle Motion Perception
Zhengye Zhang
Sirui Zhao
Shifeng Liu
Shukang Yin
Xinglong Mao
Tong Xu
Enhong Chen
MLLM
88
0
0
11 May 2025
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Yi-Fan Zhang
Xingyu Lu
X. Hu
Chaoyou Fu
Bin Wen
...
Jianfei Chen
Fan Yang
Zheng Zhang
Yan Li
Liang Wang
OffRL
LRM
108
6
0
05 May 2025
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
Zongxia Li
Xiyang Wu
Guangyao Shi
Yubin Qin
Hongyang Du
Tianyi Zhou
Dinesh Manocha
Jordan Lee Boyd-Graber
MLLM
96
0
0
02 May 2025
GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling
Siqi Li
Yufan Shen
Xiangnan Chen
Jiayi Chen
Hengwei Ju
...
Botian Shi
Y. Liu
Xinyu Cai
Yu Qiao
Yu Qiao
VLM
ELM
165
1
0
30 Apr 2025
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
Guanghao Zhou
Panjia Qiu
Chong Chen
Jiadong Wang
Zheming Yang
Jian Xu
Minghui Qiu
OffRL
LRM
172
8
0
30 Apr 2025
A Review of 3D Object Detection with Vision-Language Models
Ranjan Sapkota
Konstantinos I. Roumeliotis
Rahul Harsha Cheppally
Marco Flores Calero
Manoj Karkee
VLM
125
2
0
25 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xiang Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRL
AI4TS
SyDa
LRM
VLM
121
9
0
23 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
302
8
0
17 Apr 2025
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng
Linjie Li
Zhiyong Yang
Ping Yu
Alex Jinpeng Wang
Rui Yan
Yuan Yao
Lijuan Wang
LRM
61
1
0
08 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
90
16
0
07 Apr 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Weiyun Wang
Zhangwei Gao
Lawrence Yunliang Chen
Zhe Chen
Jinguo Zhu
...
Lewei Lu
Haodong Duan
Yu Qiao
Jifeng Dai
Wenhai Wang
LRM
122
38
0
13 Mar 2025
Qwen2.5-VL Technical Report
S. Bai
Keqin Chen
Xuejing Liu
Jialin Wang
Wenbin Ge
...
Zesen Cheng
Hang Zhang
Zhibo Yang
Haiyang Xu
Junyang Lin
VLM
327
685
0
20 Feb 2025
SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain
Dakuan Lu
Jue Chen
Rui Xu
Tianchu Yao
Chao Qu
Wei Chu
Yinghui Xu
Yuan Qi
64
8
0
28 Jan 2025
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin
Yining Ye
Junjie Fang
Han Wang
Shihao Liang
...
Haifeng Liu
F. Lin
Tao Peng
Xin Liu
Guang Shi
LLMAG
LM&Ro
98
68
0
21 Jan 2025
MLVU: Benchmarking Multi-task Long Video Understanding
Yueze Wang
Yan Shu
Bo Zhao
Boya Wu
Junjie Zhou
...
Xi Yang
Y. Xiong
Bo Zhang
Tiejun Huang
Zheng Liu
VLM
112
11
0
03 Jan 2025
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang
Shusheng Yang
Anjali W. Gupta
Rilyn Han
Li Fei-Fei
Saining Xie
LRM
180
107
0
18 Dec 2024
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen
Yicheng Liu
Yifei Huang
Yuping He
Baoqi Pei
Jilan Xu
Yali Wang
Tong Lu
Limin Wang
ELM
154
16
0
16 Dec 2024
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Xudong Lu
Yinghao Chen
Cheng Chen
Hui Tan
Boheng Chen
...
Aojun Zhou
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
37
9
0
16 Nov 2024
1
2
3
4
Next