Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.06209
Cited By
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
11 January 2024
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs"
50 / 241 papers shown
Title
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
Pengfei Wang
Guohai Xu
Weinong Wang
Junjie Yang
Jie Lou
Yunhua Xue
24
0
0
15 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
26
0
0
11 May 2025
VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models
F. Khan
Jun Chen
Youssef Mohamed
Chun-Mei Feng
Mohamed Elhoseiny
VLM
33
0
0
08 May 2025
Looking Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models
Aarti Ghatkesar
Uddeshya Upadhyay
Ganesh Venkatesh
VLM
38
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Y. Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
139
0
0
07 May 2025
VSC: Visual Search Compositional Text-to-Image Diffusion Model
Do Huu Dat
Nam Hyeonu
Po Yuan Mao
Tae-Hyun Oh
DiffM
CoGe
62
0
0
02 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
60
0
0
01 May 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
95
0
0
30 Apr 2025
AdaViP: Aligning Multi-modal LLMs via Adaptive Vision-enhanced Preference Optimization
Jinda Lu
Jinghan Li
Yuan Gao
Junkang Wu
Jiancan Wu
X. Wang
Xiangnan He
112
0
0
22 Apr 2025
Emergence and Evolution of Interpretable Concepts in Diffusion Models
Berk Tinaz
Zalan Fabian
Mahdi Soltanolkotabi
DiffM
23
0
0
21 Apr 2025
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
Yuan-Hong Liao
Sven Elflein
Liu He
Laura Leal-Taixe
Yejin Choi
Sanja Fidler
David Acuna
ReLM
LRM
VLM
123
0
0
21 Apr 2025
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Chun-Hsiao Yeh
Chenyu Wang
Shengbang Tong
Ta-Ying Cheng
Rouyu Wang
Tianzhe Chu
Yuexiang Zhai
Yubei Chen
Shenghua Gao
Yi Ma
LRM
63
0
0
21 Apr 2025
ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations
Ahmad Khalil
Mahmoud Khalil
A. Ngom
MLLM
VLM
47
0
0
20 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
29
0
0
16 Apr 2025
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Zihan Ling
Zhiyao Guo
Yixuan Huang
Yi An
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Bo Zheng
RALM
VLM
55
0
0
14 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
X. Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
26
2
0
14 Apr 2025
AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark
Aruna Gauba
Irene Pi
Yunze Man
Ziqi Pang
Vikram S. Adve
Yu-Xiong Wang
125
0
0
14 Apr 2025
VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro
Zheyuan Zhang
Monica Dou
Linkai Peng
Hongyi Pan
Ulas Bagci
Boqing Gong
VLM
58
0
0
12 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
151
0
0
11 Apr 2025
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Dibyadip Chatterjee
Edoardo Remelli
Yale Song
Bugra Tekin
Abhay Mittal
...
Shreyas Hampali
Eric Sauser
Shugao Ma
Angela Yao
Fadime Sener
VLM
46
0
0
10 Apr 2025
How Can Objects Help Video-Language Understanding?
Zitian Tang
Shijie Wang
Junho Cho
Jaewook Yoo
Chen Sun
42
0
0
10 Apr 2025
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
31
0
0
09 Apr 2025
A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models
Gaurav Verma
Jiawei Zhou
Mohit Chandra
Srijan Kumar
M. D. Choudhury
53
0
0
03 Apr 2025
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Yiting Lu
X. Li
H. Wu
Bingchen Li
Weisi Lin
Zhibo Chen
37
1
0
02 Apr 2025
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
64
2
0
01 Apr 2025
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Jewon Lee
Ki-Ung Song
Seungmin Yang
Donguk Lim
Jaeyeon Kim
Wooksu Shin
Bo-Kyeong Kim
Yong Jae Lee
Tae-Ho Kim
VLM
55
0
0
01 Apr 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Atharva Sehgal
Patrick Yuan
Ziniu Hu
Yisong Yue
Jennifer J. Sun
Swarat Chaudhuri
VLM
45
0
0
31 Mar 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
S. Voloshynovskiy
DiffM
53
0
0
31 Mar 2025
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Maximilian Augustin
Yannic Neuhaus
Matthias Hein
VLM
52
1
0
30 Mar 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou
Malvina Nikandrou
Georgios Pantazopoulos
Danae Sanchez Villegas
Phillip Rust
Ruchira Dhar
Daniel Hershcovich
Anders Søgaard
36
0
0
28 Mar 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
J. Huang
Baoxiong Jia
Y. Wang
Ziyu Zhu
Xiongkun Linghu
Qing Li
Song-Chun Zhu
Siyuan Huang
79
3
0
28 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi
Boyi Li
Han Cai
Y. Lu
Sifei Liu
...
Jan Kautz
Song Han
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
133
0
0
25 Mar 2025
LangBridge: Interpreting Image as a Combination of Language Embeddings
Jiaqi Liao
Yuwei Niu
Fanqing Meng
Hao Li
Changyao Tian
...
Dianqi Li
X. Zhu
Li Yuan
Jifeng Dai
Yu Cheng
MLLM
72
0
0
25 Mar 2025
On the Perception Bottleneck of VLMs for Chart Understanding
Junteng Liu
Weihao Zeng
Xiwen Zhang
Yijun Wang
Zifei Shan
Junxian He
60
0
0
24 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Y. Yang
Afshin Dehghan
54
1
0
24 Mar 2025
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Wenhao You
Bryan Hooi
Yiwei Wang
Y. Wang
Zong Ke
Ming Yang
Zi Huang
Yujun Cai
AAML
56
0
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
68
1
0
24 Mar 2025
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang
Yanjiang Liu
Ben He
Y. Lu
Hongyu Lin
Jia Zheng
Xianpei Han
Le Sun
Yingfei Sun
39
0
0
23 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
48
0
0
22 Mar 2025
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Davide Berasi
Matteo Farina
Massimiliano Mancini
Elisa Ricci
Nicola Strisciuglio
CoGe
68
0
0
21 Mar 2025
A Vision Centric Remote Sensing Benchmark
Abduljaleel Adejumo
Faegheh Yeganli
Clifford Broni-Bediako
Aoran Xiao
Naoto Yokoya
Mennatullah Siam
64
0
0
20 Mar 2025
TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data
Rohit Kundu
Athula Balachandran
A. Roy-Chowdhury
45
0
0
20 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
VLM
87
0
0
19 Mar 2025
TULIP: Towards Unified Language-Image Pretraining
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
VLM
CLIP
MLLM
103
3
0
19 Mar 2025
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Boshen Xu
Yuting Mei
Xinbi Liu
Sipeng Zheng
Qin Jin
VLM
MDE
65
0
0
19 Mar 2025
Where do Large Vision-Language Models Look at when Answering Questions?
X. Xing
Chia-Wen Kuo
Li Fuxin
Yulei Niu
Fan Chen
Ming Li
Ying Wu
Longyin Wen
Sijie Zhu
LRM
58
0
0
18 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
65
1
0
18 Mar 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Y. Yang
Afshin Dehghan
Peter Grasch
72
2
0
17 Mar 2025
Dynamic Relation Inference via Verb Embeddings
Omri Suissa
Muhiim Ali
Ariana Azarbal
Hui Shen
Shekhar Pradhan
43
0
0
17 Mar 2025
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
Wanhua Li
Renping Zhou
Jiawei Zhou
Yingwei Song
Johannes Herter
Minghan Qin
Gao Huang
Hanspeter Pfister
3DGS
VLM
66
0
0
13 Mar 2025
1
2
3
4
5
Next