Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.16821
Cited By
v1
v2 (latest)
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
25 April 2024
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
Erfei Cui
Wenwen Tong
Kongzhi Hu
Jiapeng Luo
Zheng Ma
Ji Ma
Jiaqi Wang
Xiao-wen Dong
Hang Yan
Hewei Guo
Conghui He
Botian Shi
Zhenjiang Jin
Chaochao Xu
Bin Wang
Xingjian Wei
Wei Li
Wenjian Zhang
Bo Zhang
Pinlong Cai
Licheng Wen
Xiangchao Yan
Min Dou
Lewei Lu
Xizhou Zhu
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLM
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (8213★)
Papers citing
"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites"
50 / 471 papers shown
Title
VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection
Huilin Deng
Hongchen Luo
Wei Zhai
Yang Cao
Yu Kang
79
2
0
30 Sep 2024
Visual Context Window Extension: A New Perspective for Long Video Understanding
Hongchen Wei
Zhenzhong Chen
VLM
88
6
0
30 Sep 2024
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
Zicheng Zhang
Ziheng Jia
H. Wu
Chunyi Li
Zijian Chen
...
Wei Sun
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
105
10
0
30 Sep 2024
Visual Question Decomposition on Multimodal Large Language Models
Haowei Zhang
Jianzhe Liu
Zhen Han
Shuo Chen
Bailan He
Volker Tresp
Zhiqiang Xu
Jindong Gu
157
2
0
28 Sep 2024
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
116
233
0
27 Sep 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
248
52
0
26 Sep 2024
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
Bowen Zhao
Leo Parker Dirac
Paulina Varshavskaya
VLM
LRM
101
0
0
25 Sep 2024
AIM 2024 Challenge on UHD Blind Photo Quality Assessment
Vlad Hosu
Marcos V. Conde
Lorenzo Agnolucci
Nabajeet Barman
Saman Zadtootaghaj
Radu Timofte
66
8
0
24 Sep 2024
CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation
Fuxian Huang
Qi Zhang
Shaopeng Zhai
Jie Wang
Tianyi Zhang
Haoran Zhang
Ming Zhou
Yu Liu
Yu Qiao
CLIP
AI4TS
84
0
0
24 Sep 2024
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Nam Hyeon-Woo
Moon Ye-Bin
Wonseok Choi
Lee Hyun
Tae-Hyun Oh
CoGe
68
3
0
23 Sep 2024
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
Junzhuo Liu
Xiaohu Yang
Weiwei Li
Peng Wang
ObjD
139
5
0
23 Sep 2024
Phantom of Latent for Large Language and Vision Models
Byung-Kwan Lee
Sangyun Chung
Chae Won Kim
Beomchan Park
Yong Man Ro
VLM
LRM
100
7
0
23 Sep 2024
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
181
19
0
23 Sep 2024
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu
Peitian Zhang
Zheng Liu
Minghao Qin
Yueze Wang
Tiejun Huang
Bo Zhao
VLM
141
59
0
22 Sep 2024
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanmin Wu
Jiayi Lei
...
Guanglu Song
Peng Gao
Yu Liu
Chunyuan Li
Hongsheng Li
MLLM
116
22
0
19 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
173
2
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
148
2
0
19 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLM
VLM
LRM
123
73
0
17 Sep 2024
OmniGen: Unified Image Generation
Shitao Xiao
Yueze Wang
Yueze Wang
Huaying Yuan
Xingrun Xing
Ruiran Yan
Shuting Wang
Tiejun Huang
Zheng Liu
DiffM
VLM
SyDa
136
88
0
17 Sep 2024
A Compressive Memory-based Retrieval Approach for Event Argument Extraction
Wanlong Liu
Enqi Zhang
Li Zhou
DingYi Zeng
Shaohuan Cheng
Chen Zhang
Malu Zhang
Wenyu Chen
RALM
90
0
0
14 Sep 2024
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha
Vinija Jain
Aman Chadha
72
3
0
14 Sep 2024
READoc: A Unified Benchmark for Realistic Document Structured Extraction
Zichao Li
Aizier Abulaiti
Yaojie Lu
Xuanang Chen
Jia Zheng
Hongyu Lin
Xianpei Han
Le Sun
75
5
0
08 Sep 2024
POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu
Zhongyin Zhao
Ziyuan Zhuang
Le Tian
Xiao Zhou
Jie Zhou
VLM
99
9
0
07 Sep 2024
Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver
Zeren Zhang
Jo-Ku Cheng
Jingyang Deng
Lu Tian
Jinwen Ma
Ziran Qin
Xiaokai Zhang
Na Zhu
Tuo Leng
87
5
0
06 Sep 2024
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei
Chenglong Liu
Jinyue Chen
Jia Wang
Lingyu Kong
...
Liang Zhao
Jianjian Sun
Yuang Peng
Chunrui Han
Xiangyu Zhang
VLM
100
55
0
03 Sep 2024
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
Baichuan Zhou
Haote Yang
Dairong Chen
Junyan Ye
Tianyi Bai
Jinhua Yu
Songyang Zhang
Dahua Lin
Conghui He
Weijia Li
VLM
179
7
0
30 Aug 2024
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
157
12
0
29 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
155
68
0
28 Aug 2024
Brain-inspired Artificial Intelligence: A Comprehensive Review
Jing Ren
Xiwei Xu
AI4CE
129
4
0
27 Aug 2024
NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals
Wei-Bang Jiang
Yansen Wang
Bao-Liang Lu
Dongsheng Li
139
15
0
27 Aug 2024
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
Jiajun Fei
Dian Li
Zhidong Deng
Zekun Wang
Gang Liu
Hui Wang
VLM
85
43
0
26 Aug 2024
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
Suyash Vardhan Mathur
J. Bafna
Kunal Kartik
Harshita Khandelwal
Manish Shrivastava
Vivek Gupta
Joey Tianyi Zhou
Dan Roth
LMTD
113
2
0
25 Aug 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
138
78
0
22 Aug 2024
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
Bohao Xing
Zitong Yu
Xin Liu
Kaishen Yuan
Qilang Ye
Weicheng Xie
Huanjing Yue
Jingyu Yang
Heikki Kälviäinen
101
13
0
21 Aug 2024
Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model
Mengying Ge
Dongkai Tang
Mingyang Li
VLM
72
1
0
21 Aug 2024
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Yanbo Ding
Shaobin Zhuang
Kunchang Li
Zhengrong Yue
Yu Qiao
Yali Wang
VGen
86
2
0
20 Aug 2024
Med-PMC: Medical Personalized Multi-modal Consultation with a Proactive Ask-First-Observe-Next Paradigm
Hongcheng Liu
Yusheng Liao
Siqv Ou
Yuhao Wang
Heyang Liu
Yanfeng Wang
Yu Wang
LM&MA
58
3
0
16 Aug 2024
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang
Xin Huang
Xiantao Zhang
Jin Zeng
LRM
123
31
0
16 Aug 2024
Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs
Jinming Liu
Yuntao Wei
Junyan Lin
Shengyang Zhao
Heming Sun
Zhibo Chen
Wenjun Zeng
Xin Jin
137
2
0
16 Aug 2024
Level Up Your Tutorials: VLMs for Game Tutorials Quality Assessment
Daniele Rege Cambrin
Gabriele Scaffidi Militone
Luca Colomba
Giovanni Malnati
D. Apiletti
Paolo Garza
94
1
0
15 Aug 2024
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao-Yang Liu
Tianjie Zhang
Yu Gu
Iat Long Iong
Yifan Xu
...
Zhengxiao Du
Chan Hee Song
Yu Su
Yuxiao Dong
Jie Tang
VLM
LLMAG
126
38
0
12 Aug 2024
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Chaoyou Fu
Haojia Lin
Zuwei Long
Yunhang Shen
Meng Zhao
...
Rongrong Ji
Xing Sun
Ran He
Caifeng Shan
Xing Sun
MLLM
140
96
0
09 Aug 2024
Arctic-TILT. Business Document Understanding at Sub-Billion Scale
Łukasz Borchmann
Michał Pietruszka
Wojciech Ja'skowski
Dawid Jurkiewicz
Piotr Halama
...
Gabriela Nowakowska
Artur Zawłocki
Łukasz Duhr
Paweł Dyda
Michał Turski
VLM
91
1
0
08 Aug 2024
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng
Jun Wang
Chuanhao Li
Quanfeng Lu
Hao Tian
...
Jifeng Dai
Ping Luo
Ping Luo
Kaipeng Zhang
Wenqi Shao
VLM
100
26
0
05 Aug 2024
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang
Yuliang Liu
Dingkang Liang
Lianwen Jin
Xiang Bai
111
14
0
04 Aug 2024
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Fushuo Huo
Wenchao Xu
Zhong Zhang
Yining Qi
Zhicheng Chen
Peilin Zhao
VLM
MLLM
212
31
0
04 Aug 2024
Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM
Can Wang
Hongliang Zhong
Menglei Chai
Mingming He
DongDong Chen
Jing Liao
LM&Ro
3DV
LRM
82
5
0
31 Jul 2024
Heads Up eXperience (HUX): Always-On AI Companion for Human Computer Environment Interaction
K. Sukanth
Sudhiksha Kandavel Rajan
S. RajashekharV.
Gowdham Prabhakar
37
1
0
28 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
114
29
0
22 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
75
2
0
22 Jul 2024
Previous
1
2
3
...
10
7
8
9
Next