ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.16821
  4. Cited By
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
v1v2 (latest)

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

25 April 2024
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
Erfei Cui
Wenwen Tong
Kongzhi Hu
Jiapeng Luo
Zheng Ma
Ji Ma
Jiaqi Wang
Xiao-wen Dong
Hang Yan
Hewei Guo
Conghui He
Botian Shi
Zhenjiang Jin
Chaochao Xu
Bin Wang
Xingjian Wei
Wei Li
Wenjian Zhang
Bo Zhang
Pinlong Cai
Licheng Wen
Xiangchao Yan
Min Dou
Lewei Lu
Xizhou Zhu
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
    MLLMVLM
ArXiv (abs)PDFHTMLGithub (8213★)

Papers citing "How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites"

50 / 471 papers shown
Title
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
Yang Tian
Zheng Lu
Mingqi Gao
Zheng Liu
Bo Zhao
LRM
84
0
0
21 Mar 2025
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence
OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence
Long Yuan
Fengran Mo
Kaiyu Huang
Wenjie Wang
Wangyuxuan Zhai
Xiaoyu Zhu
You Li
Jinan Xu
Jian-Yun Nie
SyDa
136
1
0
20 Mar 2025
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
Rohit Kundu
Athula Balachandran
Amit K. Roy-Chowdhury
107
0
0
20 Mar 2025
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
Xuanming Cui
Jaiminkumar Ashokbhai Bhoi
Chionh Wei Peng
Adriel Kuek
Ser-Nam Lim
104
0
0
20 Mar 2025
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding
Chongjun Tu
Lin Zhang
Pengtao Chen
Peng Ye
Xianfang Zeng
Wei Cheng
Gang Yu
Tao Chen
161
3
0
19 Mar 2025
CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
Wenlong Yu
Qilong Wang
Chuang Liu
Dong Li
Q. Hu
LRM
99
0
0
19 Mar 2025
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models
Jin Wang
Chenghui Lv
Xian Li
Shichao Dong
Huadong Li
Kelu Yao
Chao Li
Wenqi Shao
Ping Luo
157
1
0
19 Mar 2025
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Tengjin Weng
Jingyi Wang
Wenhao Jiang
Zhong Ming
VLMLRM
84
0
0
19 Mar 2025
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
...
David Vazquez
Christopher Pal
Perouz Taslakian
Spandana Gella
Sai Rajeswar
506
4
0
19 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAMLVLM
195
1
0
19 Mar 2025
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Zining Wang
Tongkun Guan
Pei Fu
Chen Duan
Qianyi Jiang
Zhentao Guo
Shan Guo
Junfeng Luo
Wei Shen
Xiaokang Yang
MLLMVLM
85
3
0
18 Mar 2025
Growing a Twig to Accelerate Large Vision-Language Models
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao
Mingyang Wang
Zhou Yu
Wenwen Pan
Yan Yang
Tao Wei
Hao Zhang
Ning Mao
Wei Chen
Jun Yu
VLM
92
2
0
18 Mar 2025
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models
Yiqi Zhu
Zihan Wang
Chen Zhang
Ziwei Sun
Yang Liu
CoGeVLM
118
1
0
18 Mar 2025
Aligning Multimodal LLM with Human Preference: A Survey
Aligning Multimodal LLM with Human Preference: A Survey
Tao Yu
Yize Zhang
Chaoyou Fu
Junkang Wu
Jinda Lu
...
Qingsong Wen
Zheng Zhang
Yan Huang
Liang Wang
Tieniu Tan
441
4
0
18 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
128
5
0
18 Mar 2025
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
Yang Zhou
Shiyu Zhao
Yuxiao Chen
Zhenting Wang
Can Jin
Dimitris N. Metaxas
ObjD
154
0
0
18 Mar 2025
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang
Changxu Cheng
Lingfeng Wang
Senda Chen
Wuyue Zhao
VLM
100
1
0
17 Mar 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Yue Yang
Afshin Dehghan
Peter Grasch
126
5
0
17 Mar 2025
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park
Can Cui
Yunsheng Ma
Ahmadreza Moradipari
Rohit Gupta
Kyungtae Han
Ziran Wang
76
2
0
17 Mar 2025
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models
Hao Yin
Guangzong Si
Zilei Wang
418
1
0
17 Mar 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu
Q. Yang
Yuan-Ming Li
Yi-Xing Peng
Kun-Yu Lin
Xihan Wei
Jian-Fang Hu
Xiaohua Xie
Wei-Shi Zheng
VLM
145
1
0
17 Mar 2025
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Hao Yin
Guangzong Si
Zilei Wang
132
1
0
17 Mar 2025
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Jingyi Zhang
Jiaxing Huang
Huanjin Yao
Shunyu Liu
Xikun Zhang
Shijian Lu
Dacheng Tao
LRM
147
73
0
17 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Yang Liu
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&RoLRM
389
6
0
17 Mar 2025
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
Yingyue Li
Bencheng Liao
Wenyu Liu
Xinggang Wang
Mamba
126
0
0
17 Mar 2025
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
Yunze Liu
Peiran Wu
C. Liang
Junxiao Shen
Limin Wang
Li Yi
Mamba
161
1
0
16 Mar 2025
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Kanzhi Cheng
Wenpo Song
Jiaxin Fan
Zheng Ma
Qiushi Sun
Fangzhi Xu
Chenyang Yan
Nuo Chen
Jianbing Zhang
Jiajun Chen
MLLMVLM
97
3
0
16 Mar 2025
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Junjie Chen
Xuyang Liu
Subin Huang
Linfeng Zhang
Hang Yu
103
0
0
15 Mar 2025
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Jonas Belouadi
Eddy Ilg
Margret Keuper
Hideki Tanaka
Masao Utiyama
Raj Dabre
Steffen Eger
Simone Paolo Ponzetto
205
0
0
14 Mar 2025
CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
Yuxuan Luo
Jiaqi Tang
Chenyi Huang
Feiyang Hao
Zhouhui Lian
VLM
111
0
0
13 Mar 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Weiyun Wang
Zhangwei Gao
Lawrence Yunliang Chen
Zhe Chen
Jinguo Zhu
...
Lewei Lu
Haodong Duan
Yu Qiao
Jifeng Dai
Wenhai Wang
LRM
150
39
0
13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning
Pengfei Luo
Jingbo Zhou
Tong Xu
Yuan Xia
Linli Xu
Enhong Chen
LRM
151
0
0
13 Mar 2025
VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
Zhi Rui Tam
Ya-Ting Pai
Yen-Wei Lee
Yun-Nung Chen
CoGe
173
0
0
13 Mar 2025
Can LLMs Understand Time Series Anomalies?
Can LLMs Understand Time Series Anomalies?
Zihao Zhou
Rose Yu
AI4TS
169
15
0
13 Mar 2025
Towards Fast, Memory-based and Data-Efficient Vision-Language Policy
Haoxuan Li
Sixu Yan
Yongqian Li
Xinggang Wang
LM&Ro
128
1
0
13 Mar 2025
SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
Ziyu Guo
Ray Zhang
Hao Chen
Jialin Gao
Dongzhi Jiang
Jiaze Wang
Pheng-Ann Heng
117
7
0
13 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
N. Sebe
VOSRALMVLM
119
5
0
12 Mar 2025
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Katrin Renz
Long Chen
Elahe Arani
Oleg Sinavski
MLLM
215
6
0
12 Mar 2025
Generative Frame Sampler for Long Video Understanding
Linli Yao
Haoning Wu
Kun Ouyang
Yize Zhang
Caiming Xiong
Bei Chen
Xu Sun
Junnan Li
VLMVGen
94
1
0
12 Mar 2025
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
Bozhi Luan
Wengang Zhou
Hao Feng
Zhe Wang
Xiaosong Li
Haoyang Li
VLM
131
0
0
11 Mar 2025
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang
Quan Cui
Bingchen Zhao
Cheng Yang
MLLMSyDa
117
1
0
11 Mar 2025
RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan
Yunfan Ye
Yuanjing Luo
Qian Wan
Fang Liu
Zhiping Cai
VLM
125
1
0
11 Mar 2025
ComicsPAP: understanding comic strips by picking the correct panel
ComicsPAP: understanding comic strips by picking the correct panel
Emanuele Vivoli
Artemis LLabres
Mohamed Ali Soubgui
Marco Bertini
Ernest Valveny Llobet
Dimosthenis Karatzas
157
0
0
11 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoVVLM
132
0
0
11 Mar 2025
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning
Qinghao Ye
Xianhan Zeng
Fu Li
Chong Li
Haoqi Fan
CoGe
116
5
0
10 Mar 2025
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words
Hongyu Su
Yifeng Gao
Yifan Ding
Jie Zhang
105
0
0
10 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
Xiaoyu Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
156
2
0
10 Mar 2025
Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Enming Zhang
Peizhe Gong
Xingyuan Dai
Yisheng Lv
Qinghai Miao
MLLMELM
109
2
0
09 Mar 2025
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
Runze Zhang
Guoguang Du
Xiaochuan Li
Qi Jia
Liang Jin
...
Zhenhua Guo
Yaqian Zhao
Xiaoli Gong
Rengang Li
Baoyu Fan
VGen
130
2
0
08 Mar 2025
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
Xudong Lu
Yinghao Chen
Renshou Wu
Haohao Gao
Xi Chen
...
Fangyuan Li
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
165
0
0
08 Mar 2025
Previous
12345...8910
Next