Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.13923
Cited By
Qwen2.5-VL Technical Report
20 February 2025
S. Bai
Keqin Chen
Xuejing Liu
Jialin Wang
Wenbin Ge
Sibo Song
K. Dang
P. Wang
S. Wang
J. Tang
Humen Zhong
Yuanzhi Zhu
Mingkun Yang
Zhaohai Li
Jianqiang Wan
P. Wang
Wei Ding
Zheren Fu
Yiheng Xu
Jiabo Ye
Xi Zhang
Tianbao Xie
Zesen Cheng
Hang Zhang
Zhibo Yang
Haiyang Xu
Junyang Lin
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Qwen2.5-VL Technical Report"
50 / 210 papers shown
Title
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
Chia-Yu Hung
Qi Sun
Pengfei Hong
Amir Zadeh
Chuan Li
U-Xuan Tan
Navonil Majumder
Soujanya Poria
LM&Ro
42
1
0
28 Apr 2025
Fast-Slow Thinking for Large Vision-Language Model Reasoning
W. L. Xiao
Leilei Gan
Weilong Dai
Wanggui He
Ziwei Huang
...
Fangxun Shu
Zhelun Yu
Peng Zhang
Hao Jiang
Fei Wu
ReLM
LRM
AI4CE
164
1
0
25 Apr 2025
Revisiting Data Auditing in Large Vision-Language Models
Hongyu Zhu
Sichu Liang
Luu Anh Tuan
Boheng Li
Tongxin Yuan
Fangqi Li
Shilin Wang
Zhuosheng Zhang
VLM
185
0
0
25 Apr 2025
DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model
Zhanglin Wu
Tengfei Song
Ning Xie
W. Zhang
Pengfei Li
Shuang Wu
Chong Li
Junhao Zhu
Hao Yang
37
0
0
24 Apr 2025
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Phillip Y. Lee
Jihyeon Je
Chanho Park
Mikaela Angelina Uy
Leonidas J. Guibas
Minhyuk Sung
LRM
46
0
0
24 Apr 2025
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Ling You
Wenxuan Huang
Xinni Xie
Xiangyi Wei
Bangyan Li
Shaohui Lin
Yang Li
Changbo Wang
VGen
157
1
0
24 Apr 2025
Step1X-Edit: A Practical Framework for General Image Editing
S. Liu
Yucheng Han
Peng Xing
Fukun Yin
Rui Wang
...
Yibo Zhu
Binxing Jiao
Xuzhi Zhang
Gang Yu
Daxin Jiang
DiffM
108
3
0
24 Apr 2025
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Zhikai Wang
Jiashuo Sun
Wenbo Zhang
Zhiqiang Hu
Xin Li
F. Wang
Deli Zhao
VLM
LRM
75
0
0
24 Apr 2025
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation
Ruotong Wang
Mingli Zhu
Jiarong Ou
R. J. Chen
Xin Tao
Pengfei Wan
Baoyuan Wu
DiffM
AAML
VGen
53
0
0
23 Apr 2025
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Z. Wang
Senthil Purushwalkam
Caiming Xiong
Shri Kiran Srinivasan
Heng Ji
Ran Xu
38
0
0
23 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xuben Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRL
AI4TS
SyDa
LRM
VLM
79
0
0
23 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-xiong Wang
VLM
45
0
0
22 Apr 2025
FaceInsight: A Multimodal Large Language Model for Face Perception
Jingzhi Li
Changjiang Luo
Ruoyu Chen
Hua Zhang
Wenqi Ren
Jianhou Gan
Xiaochun Cao
CVBM
LRM
65
0
0
22 Apr 2025
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi Team
Celong Liu
Chia-Wen Kuo
Dawei Du
Fan Chen
...
Wen Zhong
Xiaohui Shen
Xin Gu
Xing Mei
Xueqiong Qu
67
0
0
22 Apr 2025
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Le Zhuo
Liangbing Zhao
Sayak Paul
Yue Liao
Renrui Zhang
Yi Xin
Peng Gao
Mohamed Elhoseiny
Hao Li
VLM
75
0
0
22 Apr 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yuqing Yang
Lili Qiu
33
1
0
22 Apr 2025
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Chun-Hsiao Yeh
Chenyu Wang
Shengbang Tong
Ta-Ying Cheng
Rouyu Wang
Tianzhe Chu
Yuexiang Zhai
Yubei Chen
Shenghua Gao
Yi Ma
LRM
66
0
0
21 Apr 2025
Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform
Xianpan Zhou
VGen
58
0
0
21 Apr 2025
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Ji Qi
Y. Yao
Yushi Bai
Bin Xu
Juanzi Li
Zhiyuan Liu
Tat-Seng Chua
38
0
0
21 Apr 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
David Ma
Yuhang Zhang
J. Ren
Jarvis Guo
Yifan Yao
...
Shiwen Ni
Jing Liu
Wenhao Huang
Ge Zhang
Xiaojie Jin
VLM
40
0
0
21 Apr 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Weiye Xu
Jun Wang
Weiyun Wang
Zhe Chen
Wengang Zhou
...
Xiaohua Wang
Xizhou Zhu
Wenhai Wang
Jifeng Dai
Jinguo Zhu
VLM
LRM
55
1
0
21 Apr 2025
Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension
Lin Li
Wei Chen
Jiahui Li
Lu Chen
LRM
45
1
0
20 Apr 2025
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Songtao Jiang
Yuan Wang
Sibo Song
Yuhang Zhang
Zijie Meng
Bohan Lei
Jian Wu
Jimeng Sun
Zuozhu Liu
MedIm
VLM
42
0
0
20 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
62
0
0
20 Apr 2025
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis
Zichuan Liu
Liming Jiang
Qing Yan
Yumin Jia
Hao Kang
Xin Lu
DiffM
31
0
0
19 Apr 2025
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Yuhang Liu
Pengxiang Li
C. Xie
Xavier Hu
Xiaotian Han
Shengyu Zhang
Hongxia Yang
Fei Wu
LLMAG
LM&Ro
LRM
AI4CE
72
2
0
19 Apr 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue
Zhiqi Chen
Rui Lu
Andrew Zhao
Zhaokai Wang
Yang Yue
Shiji Song
Gao Huang
ReLM
LRM
58
13
0
18 Apr 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
Pooyan Fazli
38
0
0
18 Apr 2025
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning
Baining Zhao
Zhilin Wang
Jianjie Fang
Chen Gao
Fanhang Man
Jinqiang Cui
Xin Wang
Xinlei Chen
Y. Li
Wenwu Zhu
LM&Ro
VLM
LRM
66
1
0
17 Apr 2025
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Menglan Chen
Xianghe Pang
Jingjing Dong
Wenhao Wang
Yaxin Du
Siheng Chen
LRM
39
0
0
17 Apr 2025
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen
D. Lin
Jiangping Yang
Chunze Lin
J. Zhu
...
Di Qiu
Debang Li
Zhengcong Fei
Yang Li
Yahui Zhou
DiffM
VGen
56
1
0
17 Apr 2025
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Xiangyan Liu
Jinjie Ni
Zijian Wu
Chao Du
Longxu Dou
Haoran Wang
Tianyu Pang
Michael Shieh
OffRL
LRM
143
0
0
17 Apr 2025
LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection
Weijia Li
Guanglei Chu
Jiong Chen
Guo-Sen Xie
Caifeng Shan
Fang Zhao
LRM
37
1
0
17 Apr 2025
Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration
Yicheng Pan
Zhenrong Zhang
Pengfei Hu
Jiefeng Ma
Jun Du
Jianshu Zhang
Quan Liu
J. Gao
Feng Ma
LRM
38
0
0
17 Apr 2025
FocusedAD: Character-centric Movie Audio Description
Xiaojun Ye
C. Wang
Yiren Song
Sheng Zhou
Liangcheng Li
Jiajun Bu
VGen
55
0
0
16 Apr 2025
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Xinli Yue
Jianhui Sun
Junda Lu
Liangchao Yao
Fan Xia
Tianyi Wang
Fengyun Rao
Jing Lyu
Yuetang Deng
25
0
0
16 Apr 2025
Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach
Lvpan Cai
Haowei Wang
Jiayi Ji
YanShu ZhouMen
Yiwei Ma
Xiaoshuai Sun
Liujuan Cao
Rongrong Ji
ViT
39
0
0
16 Apr 2025
AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection
Yuhao Chao
Jie Liu
J. Tang
Gangshan Wu
35
1
0
16 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
31
0
0
16 Apr 2025
FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos
Rui Chen
Lei Sun
Jing Tang
Geng Li
Xiangxiang Chu
LRM
29
0
0
14 Apr 2025
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Tao Zhang
X. Li
Zilong Huang
Y. Li
Weixian Lei
XueQing Deng
Shihao Chen
S. Ji
Jiashi Feng
MLLM
LRM
62
2
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
Wei Wang
MLLM
VLM
70
15
1
14 Apr 2025
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu
Mengjie Liu
Jianfei Chen
Jingwei Xu
Bin Cui
Conghui He
Wentao Zhang
MLLM
59
0
0
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Zhikai Wu
Yuyao Zhang
...
Bohan Zeng
Wei Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGen
VLM
73
0
0
14 Apr 2025
HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression
Chen Zhang
Bo Hu
Weidong Chen
Zhendong Mao
149
0
0
14 Apr 2025
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo
Lu Wang
Wanwei He
Xiaobo Xia
LLMAG
51
8
0
14 Apr 2025
GeoNav: Empowering MLLMs with Explicit Geospatial Reasoning Abilities for Language-Goal Aerial Navigation
Haotian Xu
Yue Hu
Chen Gao
Zhengqiu Zhu
Yong Zhao
Yongqian Li
Quanjun Yin
39
0
0
13 Apr 2025
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Shengao Wang
Arjun Chandra
Aoming Liu
Venkatesh Saligrama
Boqing Gong
MLLM
VLM
47
0
0
13 Apr 2025
Towards Explainable Partial-AIGC Image Quality Assessment
Jiaying Qian
Ziheng Jia
Zicheng Zhang
Zeyu Zhang
Guangtao Zhai
Xiongkuo Min
40
0
0
12 Apr 2025
VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro
Zheyuan Zhang
Monica Dou
Linkai Peng
Hongyi Pan
Ulas Bagci
Boqing Gong
VLM
61
0
0
12 Apr 2025
Previous
1
2
3
4
5
Next