Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.16821
Cited By
v1
v2 (latest)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
25 April 2024
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
Erfei Cui
Wenwen Tong
Kongzhi Hu
Jiapeng Luo
Zheng Ma
Ji Ma
Jiaqi Wang
Xiao-wen Dong
Hang Yan
Hewei Guo
Conghui He
Botian Shi
Zhenjiang Jin
Chaochao Xu
Bin Wang
Xingjian Wei
Wei Li
Wenjian Zhang
Bo Zhang
Pinlong Cai
Licheng Wen
Xiangchao Yan
Min Dou
Lewei Lu
Xizhou Zhu
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (8213★)
Papers citing
"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites"
50 / 471 papers shown
Title
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
Teng Li
Quanfeng Lu
Lirui Zhao
Hao Li
X. Zhu
Yu Qiao
Jun Zhang
Wenqi Shao
22
0
0
20 Jun 2025
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
Manuel Brack
Sudeep Katakol
Felix Friedrich
P. Schramowski
Hareesh Ravi
Kristian Kersting
Ajinkya Kale
32
0
0
20 Jun 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
21
0
0
20 Jun 2025
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Haoran Sun
Yankai Jiang
Wenjie Lou
Yujie Zhang
Wenjie Li
Lilong Wang
Mianxin Liu
Lei Liu
Xiaosong Wang
LRM
18
0
0
20 Jun 2025
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
46
0
0
18 Jun 2025
Context-Informed Grounding Supervision
Hyunji Lee
Seunghyun Yoon
Yunjae Won
Hanseok Oh
Geewook Kim
Trung H. Bui
Franck Dernoncourt
Elias Stengel-Eskin
Mohit Bansal
Minjoon Seo
LRM
41
0
0
18 Jun 2025
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Chengye Wang
Yifei Shen
Zexi Kuang
Arman Cohan
Yilun Zhao
32
0
0
18 Jun 2025
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Yizhen Zhang
Yang Ding
Shuoshuo Zhang
Xinchen Zhang
Haoling Li
...
Jie Wu
Lei Ji
Yelong Shen
Y. Yang
Yeyun Gong
OffRL
VLM
LRM
26
0
0
17 Jun 2025
Image Corruption-Inspired Membership Inference Attacks against Large Vision-Language Models
Zongyu Wu
Minhua Lin
Zhiwei Zhang
Fali Wang
Xianren Zhang
Xiang Zhang
Suhang Wang
38
0
0
14 Jun 2025
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Huaying Yuan
Zheng Liu
Junjie Zhou
Ji-Rong Wen
Ji-Rong Wen
Zhicheng Dou
VLM
123
0
0
12 Jun 2025
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
Yuting Li
Lai Wei
Kaipeng Zheng
Jingyuan Huang
Linghe Kong
Lichao Sun
Weiran Huang
AAML
LRM
VLM
82
0
0
11 Jun 2025
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal
Reza Shirkavand
Heng-Chiao Huang
Gowthami Somepalli
Tom Goldstein
42
0
0
09 Jun 2025
Synthetic Visual Genome
J. S. Park
Zixian Ma
Linjie Li
Chenhao Zheng
Cheng-Yu Hsieh
...
Quan Kong
Norimasa Kobori
Ali Farhadi
Yejin Choi
Ranjay Krishna
23
0
0
09 Jun 2025
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Lidong Lu
Guo Chen
Z. Li
Yicheng Liu
Tong Lu
VLM
LRM
107
0
0
05 Jun 2025
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Hao Yan
Handong Zheng
Hao Wang
Liang Yin
Xingchen Liu
...
Minghui Liao
Chao Weng
Wei Chen
Yuliang Liu
Xiang Bai
LRM
54
0
0
03 Jun 2025
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Hyojin Bahng
Caroline Chan
F. Durand
Phillip Isola
EGVM
36
0
0
02 Jun 2025
IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
Wayne Zhang
Changjiang Jiang
Zhonghao Zhang
Chenyang Si
Fengchang Yu
Wei Peng
49
0
0
01 Jun 2025
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei
Jie Gu
Xiaokang Ma
Chu Tang
Jingmin Chen
Tong Xu
48
1
0
01 Jun 2025
Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Yunqi Hong
Sohyun An
Andrew Bai
Neil Y. C. Lin
Cho-Jui Hsieh
VLM
41
0
0
01 Jun 2025
anyECG-chat: A Generalist ECG-MLLM for Flexible ECG Input and Multi-Task Understanding
Haitao Li
Ziyu Li
Yiheng Mao
Ziyi Liu
Zhoujian Sun
Zhengxing Huang
39
0
0
01 Jun 2025
GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs
X. Zhu
Ziheng Jia
Jiarui Wang
Xiangyu Zhao
Haodong Duan
Xiongkuo Min
Jia Wang
Zicheng Zhang
Guangtao Zhai
EGVM
VLM
51
0
0
01 Jun 2025
DisTime: Distribution-based Time Representation for Video Large Language Models
Yingsen Zeng
Zepeng Huang
Yujie Zhong
Chengjian Feng
Jie Hu
Lin Ma
Yang Liu
VGen
27
0
0
30 May 2025
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Xin He
Xumeng Han
Longhui Wei
Lingxi Xie
Qi Tian
MoE
47
0
0
30 May 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay
Mukul Ranjan
Zhiqiang Shen
Mohamed Elhoseiny
VLM
28
0
0
30 May 2025
When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways
Kailin Jiang
Yuntao Du
Yukai Ding
Yuchen Ren
Ning Jiang
Zhi Gao
Zilong Zheng
Lei Liu
Bin Li
Qing Li
KELM
51
0
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
39
1
0
30 May 2025
MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs
Zhiwei Liu
Lingfei Qian
Qianqian Xie
J. Huang
Kailai Yang
Sophia Ananiadou
22
0
0
30 May 2025
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
Yunze Man
De-An Huang
Guilin Liu
Shiwei Sheng
Shilong Liu
Liang-Yan Gui
Jan Kautz
Yu Wang
Zhiding Yu
MLLM
LRM
76
0
0
29 May 2025
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Sungjune Park
Hyunjun Kim
Junho Kim
S. T. Kim
Y. Ro
LRM
123
0
0
29 May 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang
Yan Shu
Zhifei Yang
Yan Zhang
Yu-Hong Li
K. Lu
Gangyan Zeng
Shaohui Liu
Yu Zhou
N. Sebe
CoGe
70
0
0
28 May 2025
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning
Jiaqi Huang
Zunnan Xu
Jun Zhou
Ting Liu
Yicheng Xiao
Mingwen Ou
Bowen Ji
Xiu Li
Kehong Yuan
VLM
93
0
0
28 May 2025
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Bozhou Li
Wentao Zhang
VLM
42
0
0
27 May 2025
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Xuanwen Ding
Chengjun Pan
Zejun Li
Jiwen Zhang
Siyuan Wang
Zhongyu Wei
67
0
0
27 May 2025
Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning
Ruolin Shen
Xiaozhong Ji
Kai WU
Jiangning Zhang
Yijun He
HaiHua Yang
Xiaobin Hu
Xiaoyu Sun
82
0
0
26 May 2025
Two Causally Related Needles in a Video Haystack
Miaoyu Li
Qin Chao
Boyang Albert Li
CML
61
0
0
26 May 2025
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Matthew Lisondra
B. Benhabib
G. Nejat
LM&Ro
80
0
0
26 May 2025
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Hyunsik Chae
Seungwoo Yoon
J. Park
Chloe Yewon Chun
Yongin Cho
Mu Cai
Yong Jae Lee
Ernest K. Ryu
CoGe
VLM
58
3
0
26 May 2025
Large Language Models for Planning: A Comprehensive and Systematic Survey
Pengfei Cao
Tianyi Men
Wencan Liu
Jingwen Zhang
Xuzhao Li
Xixun Lin
Dianbo Sui
Yanan Cao
Kang Liu
Jun Zhao
LLMAG
LM&Ro
OffRL
ELM
LRM
131
0
0
26 May 2025
Efficient Multi-modal Long Context Learning for Training-free Adaptation
Zehong Ma
Shiliang Zhang
Longhui Wei
Qi Tian
VLM
53
0
0
26 May 2025
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
Jin Wang
Yao Lai
Aoxue Li
Shifeng Zhang
Jiacheng Sun
Ning Kang
Chengyue Wu
Zhenguo Li
Ping Luo
74
2
0
26 May 2025
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Fanheng Kong
Jingyuan Zhang
Hongzhi Zhang
Shi Feng
Daling Wang
Linhao Yu
Xingguang Ji
Yu Tian
Qi Wang
Fuzheng Zhang
62
1
0
26 May 2025
USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models
Baolin Zheng
Guanlin Chen
Hongqiong Zhong
Qingyang Teng
Yingshui Tan
...
Jincheng Wei
Wenbo Su
Xiaoyong Zhu
Bo Zheng
Kaifu Zhang
ELM
29
0
0
26 May 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Yongheng Zhang
Xu Liu
Ruoxi Zhou
Qiguang Chen
Hao Fei
Wenpeng Lu
L. Qin
HILM
LRM
33
0
0
25 May 2025
Doc-CoB: Enhancing Multi-Modal Document Understanding with Visual Chain-of-Boxes Reasoning
Ye Mo
Zirui Shao
Kai Ye
Xianwei Mao
Bo Zhang
...
Gang Huang
Kehan Chen
Zhou Huan
Zixu Yan
Sheng Zhou
LRM
62
0
0
24 May 2025
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models
Duo Li
Zuhao Yang
Shijian Lu
VLM
98
0
0
24 May 2025
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Runpeng Yu
Xinyin Ma
Xinchao Wang
MLLM
117
2
0
22 May 2025
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
Shilin Yan
Jiaming Han
Joey Tsai
Hongwei Xue
Rongyao Fang
Lingyi Hong
Ziyu Guo
Ray Zhang
VLM
95
4
0
22 May 2025
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Runsen Xu
Weiyao Wang
Hao Tang
Xingyu Chen
Xiaodong Wang
Fu-Jen Chu
Dahua Lin
Matt Feiszli
Kevin J. Liang
LRM
115
1
0
22 May 2025
NovelSeek: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
NovelSeek Team
Bo Zhang
Shiyang Feng
Xiangchao Yan
Jiakang Yuan
...
Zhongying Tu
Xiangyu Yue
W. Ouyang
Bowen Zhou
Lei Bai
LLMAG
110
2
0
22 May 2025
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Zeqing Wang
Shiyuan Zhang
Chengpei Tang
Keze Wang
LRM
81
0
0
21 May 2025
1
2
3
4
...
8
9
10
Next