Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.14238
Cited By
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
21 December 2023
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
Sen Xing
Zhong Muyan
Qinglong Zhang
Xizhou Zhu
Lewei Lu
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks"
50 / 224 papers shown
Title
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Zongchuang Zhao
Haoyu Fu
Dingkang Liang
Xin Zhou
Dingyuan Zhang
Hongwei Xie
Bing Wang
Xiang Bai
MLLM
VLM
44
0
0
13 May 2025
SITE: towards Spatial Intelligence Thorough Evaluation
W. Wang
Reuben Tan
Pengyue Zhu
Jianwei Yang
Zhengyuan Yang
Lijuan Wang
Andrey Kolobov
Jianfeng Gao
Boqing Gong
45
0
0
08 May 2025
VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models
F. Khan
Jun Chen
Youssef Mohamed
Chun-Mei Feng
Mohamed Elhoseiny
VLM
33
0
0
08 May 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping-Chia Huang
OffRL
49
0
0
08 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
44
0
0
08 May 2025
R^3-VQA: "Read the Room" by Video Social Reasoning
Lixing Niu
Jiapeng Li
Xingping Yu
Shu Wang
Ruining Feng
Bo Wu
Ping Wei
Y. Wang
Lifeng Fan
45
0
0
07 May 2025
RoboOS: A Hierarchical Embodied Framework for Cross-Embodiment and Multi-Agent Collaboration
Huajie Tan
Xiaoshuai Hao
Minglan Lin
Pengwei Wang
Yaoxu Lyu
Mingyu Cao
Zhongyuan Wang
S. Zhang
LM&Ro
46
0
0
06 May 2025
Mitigating Image Captioning Hallucinations in Vision-Language Models
Fei Zhao
C. Zhang
Runlin Zhang
Tianyang Wang
Xi Li
VLM
39
0
0
06 May 2025
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Z. Wang
Yang Zhou
Jinhai Xiang
Y. Wang
Xinwei He
VLM
42
0
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
74
0
0
05 May 2025
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Yi-Fan Zhang
Xingyu Lu
X. Hu
Chaoyou Fu
Bin Wen
...
J. Chen
Fan Yang
Z. Zhang
Tingting Gao
Liang Wang
OffRL
LRM
41
0
0
05 May 2025
SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
Jinpeng Chen
Runmin Cong
Yuzhi Zhao
Hongzheng Yang
Guangneng Hu
H. Ip
Sam Kwong
CLL
KELM
76
0
0
05 May 2025
An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding
Siyang Jiang
Bufang Yang
Lilin Xu
Mu Yuan
Yeerzhati Abudunuer
...
Liekang Zeng
Hongkai Chen
Zhenyu Yan
Xiaofan Jiang
Guoliang Xing
VLM
86
0
0
03 May 2025
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos
Zongxia Li
Xiyang Wu
Yubin Qin
Guangyao Shi
Hongyang Du
Dinesh Manocha
Tianyi Zhou
Jordan Boyd-Graber
MLLM
46
0
0
02 May 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
95
0
0
30 Apr 2025
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation
Linshan Wu
Yuxiang Nie
Sunan He
Jiaxin Zhuang
Hao Chen
LM&MA
MedIm
73
0
0
30 Apr 2025
AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images
Yunhao Li
Sijing Wu
Wei Sun
Zhichao Zhang
Yucheng Zhu
Zicheng Zhang
Huiyu Duan
Xiongkuo Min
Guangtao Zhai
EGVM
88
0
0
30 Apr 2025
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Yumeng Shi
Quanyu Long
Wenya Wang
64
0
0
30 Apr 2025
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation
Jianyu Wu
Yizhou Wang
Xiangyu Yue
Xinzhu Ma
J. Guo
Dongzhan Zhou
Wanli Ouyang
Shixiang Tang
66
0
0
29 Apr 2025
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception
Yuanchen Wu
Lu Zhang
Hang Yao
Junlong Du
Ke Yan
Shouhong Ding
Yunsheng Wu
X. Li
MLLM
71
0
0
29 Apr 2025
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo
Thao Nguyen
Xun Huang
Siddharth Srinivasan Iyer
Yijun Li
...
Eli Shechtman
Krishna Kumar Singh
Yong Jae Lee
Bolei Zhou
Yuheng Li
77
0
0
29 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
84
0
0
29 Apr 2025
YoChameleon: Personalized Vision and Language Generation
Thao Nguyen
Krishna Kumar Singh
Jing Shi
Trung H. Bui
Yong Jae Lee
Yuheng Li
MLLM
82
0
0
29 Apr 2025
SynergyAmodal: Deocclude Anything with Text Control
Xinyang Li
Chengjie Yi
Jiawei Lai
Mingbao Lin
Yansong Qu
Shengchuan Zhang
Liujuan Cao
DiffM
73
0
0
28 Apr 2025
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang
Wenliang Zheng
Aashrith Madasu
Peng Shi
Ryo Kamoi
...
Ranran Haoran Zhang
Avitej Iyer
Renze Lou
Wenpeng Yin
Rui Zhang
68
0
0
25 Apr 2025
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation
Peiyuan Jing
Kinhei Lee
Zhenxuan Zhang
Huichi Zhou
Zhengqing Yuan
Zhifan Gao
Lei Zhu
G. Papanastasiou
Yingying Fang
Guang Yang
MedIm
OffRL
LRM
63
0
0
25 Apr 2025
CAMU: Context Augmentation for Meme Understanding
Girish A. Koushik
Diptesh Kanojia
Helen Treharne
Aditya Joshi
VLM
96
0
0
24 Apr 2025
Circinus: Efficient Query Planner for Compound ML Serving
Banruo Liu
Wei-Yu Lin
Minghao Fang
Yihan Jiang
Fan Lai
LRM
34
0
0
23 Apr 2025
V
2
^2
2
R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations
Zhiyuan Fan
Yumeng Wang
Sandeep Polisetty
Yi Ren Fung
50
0
0
23 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
X. Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRL
AI4TS
SyDa
LRM
VLM
74
0
0
23 Apr 2025
AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization
Jinda Lu
Jinghan Li
Yuan Gao
Junkang Wu
Jiancan Wu
X. Wang
Xiangnan He
103
0
0
22 Apr 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Weiye Xu
J. Wang
Weiyun Wang
Zhe Chen
Wengang Zhou
...
Xiaohua Wang
Xizhou Zhu
Wenhai Wang
Jifeng Dai
Jinguo Zhu
VLM
LRM
53
0
0
21 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
59
0
0
20 Apr 2025
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Yikun Ji
Y. Hong
Jiahui Zhan
H. Chen
Jun Lan
Huijia Zhu
Weiqiang Wang
L. Zhang
Jianfu Zhang
MLLM
LRM
46
0
0
19 Apr 2025
Manipulating Multimodal Agents via Cross-Modal Prompt Injection
Le Wang
Zonghao Ying
Tianyuan Zhang
Siyuan Liang
Shengshan Hu
Mingchuan Zhang
A. Liu
Xianglong Liu
AAML
31
1
0
19 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Liangyu Xu
Yingxiu Zhao
J. Wang
Yingyao Wang
Bu Pi
...
Jihao Gu
X. Li
Xiaoyong Zhu
Jun Song
Bo Zheng
LRM
145
1
0
17 Apr 2025
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Xinyi Liu
Xiaoyi Zhang
Ziyun Zhang
Yan Lu
34
0
0
15 Apr 2025
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
Henghui Ding
Chang Liu
Nikhila Ravi
Shuting He
Y. Wei
...
Haobo Yuan
X. Li
Tao Zhang
Lu Qi
Ming Yang
28
0
0
15 Apr 2025
Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes
Huijie Liu
Bingcan Wang
Jie Hu
Xiaoming Wei
Guoliang Kang
63
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
68
7
1
14 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination
Hao Yin
Gunagzong Si
Zilei Wang
125
0
0
14 Apr 2025
Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation
Xiaoxing Hu
Ziyang Gong
Y. Wang
Yuru Jia
Gen Luo
Xue Yang
118
0
0
08 Apr 2025
Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision
Yuandong Pu
Le Zhuo
Kaiwen Zhu
Liangbin Xie
Wenlong Zhang
Xiangyu Chen
Peng Gao
Yu Qiao
Chao Dong
Yihao Liu
MLLM
61
1
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
37
0
0
07 Apr 2025
MedM-VL: What Makes a Good Medical LVLM?
Yiming Shi
Shaoshuai Yang
Xun Zhu
Haoyu Wang
Miao Li
Ji Wu
VLM
40
1
0
06 Apr 2025
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Zhenyi Liao
Qingsong Xie
Yanhao Zhang
Zijian Kong
Haonan Lu
Zhenyu Yang
Zhijie Deng
ReLM
VLM
LRM
101
0
1
01 Apr 2025
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Y. Li
Y. Zhang
Tao Lin
Xiangrui Liu
Wenxiao Cai
Zheng Liu
Bo Zhao
LRM
56
0
0
31 Mar 2025
Towards Understanding How Knowledge Evolves in Large Vision-Language Models
Sudong Wang
Y. Zhang
Yao Zhu
Jianing Li
Zizhe Wang
Y. Liu
Xiangyang Ji
120
0
0
31 Mar 2025
VideoGen-Eval: Agent-based System for Video Generation Evaluation
Yuhang Yang
Ke Fan
S.
Hongxiang Li
Ailing Zeng
FeiLin Han
Wei-dong Zhai
W. Liu
Yang Cao
Zheng-jun Zha
EGVM
VGen
73
0
0
30 Mar 2025
1
2
3
4
5
Next