ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.06209
  4. Cited By
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

11 January 2024
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
    VLM
    MLLM
ArXivPDFHTML

Papers citing "Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs"

50 / 241 papers shown
Title
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
  Search Engines
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanmin Wu
Jiayi Lei
...
Guanglu Song
Peng Gao
Yu Liu
Chunyuan Li
Hongsheng Li
MLLM
29
16
0
19 Sep 2024
Fit and Prune: Fast and Training-free Visual Token Pruning for
  Multi-modal Large Language Models
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye
Qiong Wu
Wenhao Lin
Yiyi Zhou
VLM
35
10
0
16 Sep 2024
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
Yupeng Chen
Penglin Chen
Xiaoyu Zhang
Yixian Huang
Qian Xie
DiffM
46
1
0
15 Sep 2024
Anytime Continual Learning for Open Vocabulary Classification
Anytime Continual Learning for Open Vocabulary Classification
Zhen Zhu
Yiming Gong
Derek Hoiem
VLM
37
1
0
13 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
115
1
0
04 Sep 2024
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for
  Robotic Manipulation
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang
Chen Wang
Y. Li
Ruohan Zhang
Li Fei-Fei
46
87
0
03 Sep 2024
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal
  Models
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models
Bin Fu
Qiyang Wan
Jialin Li
Ruiping Wang
Xilin Chen
40
0
0
03 Sep 2024
Understanding Multimodal Hallucination with Parameter-Free
  Representation Alignment
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
Yueqian Wang
Jianxin Liang
Yuxuan Wang
Huishuai Zhang
Dongyan Zhao
41
1
0
02 Sep 2024
Law of Vision Representation in MLLMs
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
40
9
0
29 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
50
20
0
28 Aug 2024
GlaLSTM: A Concurrent LSTM Stream Framework for Glaucoma Detection via Biomarker Mining
GlaLSTM: A Concurrent LSTM Stream Framework for Glaucoma Detection via Biomarker Mining
Cheng Huang
Weizheng Xie
Jian Zhou
Karanjit S Kooner
Karanjit Kooner
Yishen Liu
33
1
0
28 Aug 2024
Visual Agents as Fast and Slow Thinkers
Visual Agents as Fast and Slow Thinkers
Guangyan Sun
Mingyu Jin
Zhenting Wang
Cheng-Long Wang
Siqi Ma
Qifan Wang
Ying Nian Wu
Ying Nian Wu
Dongfang Liu
Dongfang Liu
LLMAG
LRM
77
13
0
16 Aug 2024
Response Wide Shut: Surprising Observations in Basic Vision Language
  Model Capabilities
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
Shivam Chandhok
Wan-Cyuan Fan
Leonid Sigal
VLM
MLLM
23
3
0
13 Aug 2024
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
  Models
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Qirui Jiao
Daoyuan Chen
Yilun Huang
Yaliang Li
Ying Shen
VLM
32
5
0
08 Aug 2024
Targeted Visual Prompting for Medical Visual Question Answering
Targeted Visual Prompting for Medical Visual Question Answering
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
26
2
0
06 Aug 2024
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models
  within Perturbed Inputs
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs
Peng Ding
Jingyu Wu
M. Girolami
Dan Ma
Xuezhi Cao
Xunliang Cai
Shi Chen
T. J. Sullivan
Shujian Huang
AAML
VLM
MLLM
31
4
0
02 Aug 2024
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language
  Models
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models
Junda Wu
Xintong Li
Tong Yu
Yu-Xiang Wang
Xiang Chen
Jiuxiang Gu
Lina Yao
Jingbo Shang
Julian McAuley
41
0
0
29 Jul 2024
Diffusion Feedback Helps CLIP See Better
Diffusion Feedback Helps CLIP See Better
Wenxuan Wang
Quan-Sen Sun
Fan Zhang
Yepeng Tang
Jing Liu
Xinlong Wang
VLM
40
14
0
29 Jul 2024
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's
  Impact on Spatio-Temporal Cross-Attentions
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
Ashkan Taghipour
Morteza Ghahremani
Bennamoun
Aref Miri Rekavandi
Zinuo Li
Hamid Laga
F. Boussaïd
VGen
73
2
0
27 Jul 2024
VACoDe: Visual Augmented Contrastive Decoding
VACoDe: Visual Augmented Contrastive Decoding
Sihyeon Kim
Boryeong Cho
Sangmin Bae
Sumyeong Ahn
SeYoung Yun
34
3
0
26 Jul 2024
Every Part Matters: Integrity Verification of Scientific Figures Based
  on Multimodal Large Language Models
Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models
Xiang Shi
Jiawei Liu
Yinpeng Liu
Qikai Cheng
Wei Lu
39
0
0
26 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language
  Alignment
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng-Wei Zhang
Tong He
VLM
39
3
0
25 Jul 2024
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large
  Language Models
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
Leyang Shen
Gongwei Chen
Rui Shao
Weili Guan
Liqiang Nie
MoE
40
6
0
17 Jul 2024
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
  Instruction Using Language Model
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Wenqi Zhang
Zhenglin Cheng
Yuanyu He
Mengna Wang
Yongliang Shen
...
Guiyang Hou
Mingqian He
Yanna Ma
Weiming Lu
Yueting Zhuang
SyDa
66
9
0
09 Jul 2024
Smart Vision-Language Reasoners
Smart Vision-Language Reasoners
Denisa Roberts
Lucas Roberts
VLM
ReLM
LRM
46
4
0
05 Jul 2024
VSP: Assessing the dual challenges of perception and reasoning in
  spatial planning tasks for VLMs
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
Qiucheng Wu
Handong Zhao
Michael Stephen Saxon
T. Bui
William Yang Wang
Yang Zhang
Shiyu Chang
CoGe
38
4
0
02 Jul 2024
Tell Me Where You Are: Multimodal LLMs Meet Place Recognition
Tell Me Where You Are: Multimodal LLMs Meet Place Recognition
Zonglin Lyu
Juexiao Zhang
Mingxuan Lu
Yiming Li
Chen Feng
38
4
0
25 Jun 2024
MM-SpuBench: Towards Better Understanding of Spurious Biases in
  Multimodal LLMs
MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
Wenqian Ye
Guangtao Zheng
Yunsheng Ma
Xu Cao
Bolin Lai
James M. Rehg
Aidong Zhang
37
10
0
24 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DV
MLLM
48
279
0
24 Jun 2024
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark
  with Human-VLM Collaboration
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
Yujin Baek
chaeHun Park
Jaeseok Kim
Yu-Jung Heo
Du-Seong Chang
Jaegul Choo
23
3
0
24 Jun 2024
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
  Large Video-Language Models
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang
Yueqian Wang
Dongyan Zhao
Cihang Xie
Zilong Zheng
MLLM
VLM
42
25
0
24 Jun 2024
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
Yuxuan Wan
Chaozheng Wang
Yi Dong
Wenxuan Wang
Shuqing Li
Yintong Huo
M. Lyu
3DV
73
10
0
24 Jun 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Yuxuan Qiao
Haodong Duan
Xinyu Fang
Junming Yang
Lin Chen
Songyang Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LRM
37
18
0
20 Jun 2024
Look Further Ahead: Testing the Limits of GPT-4 in Path Planning
Look Further Ahead: Testing the Limits of GPT-4 in Path Planning
Mohamed Aghzal
E. Plaku
Ziyu Yao
ELM
36
6
0
17 Jun 2024
What is the Visual Cognition Gap between Humans and Multimodal LLMs?
What is the Visual Cognition Gap between Humans and Multimodal LLMs?
Xu Cao
Bolin Lai
Wenqian Ye
Yunsheng Ma
Joerg Heintz
Jintai Chen
Jianguo Cao
James M. Rehg
45
8
0
14 Jun 2024
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu
Weijia Shi
Xingyu Fu
Dan Roth
Mari Ostendorf
Luke Zettlemoyer
Noah A. Smith
Ranjay Krishna
LRM
50
37
0
13 Jun 2024
Exploring the Spectrum of Visio-Linguistic Compositionality and
  Recognition
Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition
Youngtaek Oh
Pyunghwan Ahn
Jinhyung Kim
Gwangmo Song
Soonyoung Lee
In So Kweon
Junmo Kim
CoGe
42
2
0
13 Jun 2024
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
Rithesh Murthy
Liangwei Yang
Juntao Tan
Tulika Awalgaonkar
Yilun Zhou
...
Zuxin Liu
Ming Zhu
Huan Wang
Caiming Xiong
Silvio Savarese
57
5
0
12 Jun 2024
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and
  Effective for LMMs
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng
Jianwei Yang
Rui Tian
Xiyang Dai
Zuxuan Wu
Jianfeng Gao
Yu-Gang Jiang
VLM
22
9
0
06 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
47
10
0
04 Jun 2024
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval
  from Linguistically Complex Descriptions
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
Honglin Lin
Siyu Li
Gu Nan
Chaoyue Tang
Xueting Wang
...
Yankai Rong
Zhili Zhou
Yutong Gao
Qimei Cui
Xiaofeng Tao
25
0
0
29 May 2024
Why are Visually-Grounded Language Models Bad at Image Classification?
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang
Alyssa Unell
Xiaohan Wang
Dhruba Ghosh
Yuchang Su
Ludwig Schmidt
Serena Yeung-Levy
VLM
35
27
0
28 May 2024
Don't Miss the Forest for the Trees: Attentional Vision Calibration for
  Large Vision Language Models
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
Sangmin Woo
Donguk Kim
Jaehyuk Jang
Yubin Choi
Changick Kim
42
12
0
28 May 2024
Dense Connector for MLLMs
Dense Connector for MLLMs
Huanjin Yao
Wenhao Wu
Taojiannan Yang
Yuxin Song
Mengxi Zhang
Haocheng Feng
Yifan Sun
Zhiheng Li
Wanli Ouyang
Jingdong Wang
MLLM
VLM
39
16
0
22 May 2024
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large
  Multimodal Models
TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models
Junlong Jia
Ying Hu
Xi Weng
Yiming Shi
Miao Li
...
Baichuan Zhou
Ziyu Liu
Jie Luo
Lei Huang
Ji Wu
34
9
0
20 May 2024
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via
  Reinforcement Learning
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Yuexiang Zhai
Hao Bai
Zipeng Lin
Jiayi Pan
Shengbang Tong
...
Alane Suhr
Saining Xie
Yann LeCun
Yi-An Ma
Sergey Levine
LLMAG
LRM
39
56
0
16 May 2024
Libra: Building Decoupled Vision System on Large Language Models
Libra: Building Decoupled Vision System on Large Language Models
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLM
VLM
43
6
0
16 May 2024
Investigating the Semantic Robustness of CLIP-based Zero-Shot Anomaly
  Segmentation
Investigating the Semantic Robustness of CLIP-based Zero-Shot Anomaly Segmentation
Kevin Stangl
Marius Arvinte
Weilin Xu
Cory Cornelius
VLM
UQCV
34
0
0
13 May 2024
Transcrib3D: 3D Referring Expression Resolution through Large Language
  Models
Transcrib3D: 3D Referring Expression Resolution through Large Language Models
Jiading Fang
Xiangshan Tan
Shengjie Lin
Igor Vasiljevic
Vitor Campagnolo Guizilini
Hongyuan Mei
Rares Ambrus
Gregory Shakhnarovich
Matthew R. Walter
LM&Ro
33
4
0
30 Apr 2024
Hallucination of Multimodal Large Language Models: A Survey
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai
Pichao Wang
Tianjun Xiao
Tong He
Zongbo Han
Zheng Zhang
Mike Zheng Shou
VLM
LRM
95
139
0
29 Apr 2024
Previous
12345
Next