Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models
Linglin Jing
Yuting Gao
Zhigang Wang
Wang Lan
Yiwen Tang
Wenhai Wang
Kaipeng Zhang
Qingpei Guo
MoE
45
0
0
28 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
Haitao Mi
Dong Yu
VLM
107
0
0
28 May 2025
ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval
Eric Xing
Pranavi Kolouju
Robert Pless
Abby Stylianou
Nathan Jacobs
28
0
0
27 May 2025
Multimodal Federated Learning: A Survey through the Lens of Different FL Paradigms
Yuanzhe Peng
Jieming Bian
Lei Wang
Yin Huang
Jie Xu
29
0
0
27 May 2025
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Shurong Zheng
Fan Yang
Ming Tang
Jinqiao Wang
VLM
LRM
69
0
0
27 May 2025
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Bozhou Li
Wentao Zhang
VLM
42
0
0
27 May 2025
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Chengyue Huang
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
AAML
64
0
0
27 May 2025
RefAV: Towards Planning-Centric Scenario Mining
Cainan Davidson
Deva Ramanan
Neehar Peri
91
2
0
27 May 2025
MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering
Xu Li
Fan Lyu
LRM
29
0
0
26 May 2025
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models
Hyunsik Chae
Seungwoo Yoon
J. Park
Chloe Yewon Chun
Yongin Cho
Mu Cai
Yong Jae Lee
Ernest K. Ryu
CoGe
VLM
60
3
0
26 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLM
LRM
94
1
0
26 May 2025
My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals
Jian Lan
Yifei Fu
Udo Schlegel
Gengyuan Zhang
Tanveer Hannan
Haokun Chen
Thomas Seidl
24
0
0
26 May 2025
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models
Xinmiao Hu
C. Wang
Ruihe An
ChenYu Shao
Xiaojun Ye
Sheng Zhou
Liangcheng Li
MLLM
LRM
78
0
0
26 May 2025
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance
Mohammad Mahdi Moradi
Sudhir Mudur
105
0
0
25 May 2025
Caption This, Reason That: VLMs Caught in the Middle
Zihan Weng
Lucas Gomez
Taylor Whittington Webb
P. Bashivan
VLM
LRM
52
0
0
24 May 2025
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models
Duo Li
Zuhao Yang
Shijian Lu
VLM
105
0
0
24 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLM
VLM
499
0
0
23 May 2025
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval
Yuxin Yang
Yinan Zhou
Yuxin Chen
Ziqi Zhang
Zongyang Ma
...
Bing Li
Lin Song
Jun Gao
Peng Li
Weiming Hu
201
0
0
23 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
264
0
0
23 May 2025
TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Zeqing Wang
Shiyuan Zhang
Chengpei Tang
Keze Wang
LRM
81
0
0
21 May 2025
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Junlin Li
Guodong DU
Jing Li
Sim Kuan Goh
Wenya Wang
...
Fangming Liu
Jing Li
Saleh Alharbi
Daojing He
Min Zhang
MoMe
CLL
146
1
0
21 May 2025
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
Yanshu Li
Tian Yun
Jianjiang Yang
Pinyuan Feng
Jinfa Huang
Ruixiang Tang
69
2
0
21 May 2025
CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
Yanshu Li
JianJiang Yang
Bozheng Li
Ruixiang Tang
70
2
0
21 May 2025
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
Ingeol Baek
Hwan Chang
Sunghyun Ryu
Hwanhee Lee
56
0
0
21 May 2025
Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid
Lucrezia Tosato
F. Weissgerber
Camille Kurtz
Laurent Wendling
Sylvain Lobry
69
0
0
21 May 2025
VoQA: Visual-only Question Answering
Luyang Jiang
Jianing An
Jie Luo
Wenjun Wu
Lei Huang
LRM
108
0
0
20 May 2025
ModRWKV: Transformer Multimodality in Linear Time
Jiale Kang
Ziyin Yue
Qingyu Yin
Jiang Rui
W. Li
Zening Lu
Zhouran Ji
OffRL
98
0
0
20 May 2025
AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning
Kai Zhang
Xingyu Chen
Xiaofeng Zhang
118
0
0
19 May 2025
STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference
Yichen Guo
Hanze Li
Zonghao Zhang
Jinhao You
Kai Tang
Xiande Huang
VLM
87
0
0
18 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
153
0
0
17 May 2025
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?
Zihao Dongfang
Xu Zheng
Ziqiao Weng
Yuanhuiyi Lyu
Danda Pani Paudel
Luc Van Gool
Kailun Yang
Xuming Hu
LRM
86
0
0
17 May 2025
Visual Planning: Let's Think Only with Images
Yi Xu
Chengzu Li
Han Zhou
Xingchen Wan
Caiqi Zhang
Anna Korhonen
Ivan Vulić
LM&Ro
LRM
170
1
0
16 May 2025
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
Pengju Xu
Yan Wang
Shuyuan Zhang
Xuan Zhou
Xin Li
...
Fengzhao Li
Shuigeng Zhou
Xingyu Wang
Yi Zhang
Haiying Zhao
VLM
143
1
0
16 May 2025
Task-Core Memory Management and Consolidation for Long-term Continual Learning
Tianyu Huai
Jie Zhou
Yuxuan Cai
Qin Chen
Wen Wu
Xingjiao Wu
Xipeng Qiu
Liang He
CLL
100
0
0
15 May 2025
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Dayong Liang
Changmeng Zheng
Zhiyuan Wen
Yi Cai
Xiao Wei
Qing Li
LRM
58
0
0
14 May 2025
Variational Visual Question Answering
Tobias Jan Wieczorek
Nathalie Daun
Mohammad Emtiyaz Khan
Marcus Rohrbach
OOD
96
0
0
14 May 2025
Bias and Generalizability of Foundation Models across Datasets in Breast Mammography
Elodie Germani
Selin Türk Ilayda
Zeineddine Fatima
Mourad Charbel
Shadi Albarqouni
AI4CE
127
0
0
14 May 2025
Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng
A. Goel
Hakan Bilen
LRM
72
0
0
12 May 2025
DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models
Shucheng Huang
Freda Shi
Chen Sun
Jiaming Zhong
Minghao Ning
Yufeng Yang
Yukun Lu
Hong Wang
A. Khajepour
101
0
0
11 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDL
VLM
118
0
0
08 May 2025
SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
Jinpeng Chen
Runmin Cong
Yuzhi Zhao
Hongzheng Yang
Guangneng Hu
H. Ip
Sam Kwong
CLL
KELM
146
2
0
05 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
332
1
0
05 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
Jieneng Chen
LRM
154
1
0
01 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
109
0
0
01 May 2025
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
Trilok Padhi
R. Kaur
Adam D. Cobb
Manoj Acharya
Anirban Roy
Colin Samplawski
Brian Matejek
Alexander M. Berenbeim
Nathaniel D. Bastian
Susmit Jha
80
0
0
30 Apr 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
148
0
0
30 Apr 2025
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Run Luo
Renke Shan
Longze Chen
Ziqiang Liu
Lu Wang
Min Yang
Xiaobo Xia
MLLM
VLM
264
1
0
28 Apr 2025
Platonic Grounding for Efficient Multimodal Language Models
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
Lav Varshney
122
0
0
27 Apr 2025
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?
Yusen Zhang
Wenliang Zheng
Aashrith Madasu
Peng Shi
Ryo Kamoi
...
Ranran Haoran Zhang
Avitej Iyer
Renze Lou
Wenpeng Yin
Rui Zhang
322
0
0
25 Apr 2025
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Zehao Wang
Senthil Purushwalkam
Caiming Xiong
Siyang Song
Chenhui Xu
Ran Xu
183
2
0
23 Apr 2025
Previous
1
2
3
4
5
...
39
40
41
Next