Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 1,968 papers shown
Title
Asking More Informative Questions for Grounded Retrieval
Sedrick Scott Keh
Justin T. Chiu
Daniel Fried
19
3
0
14 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
22
8
0
14 Nov 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLM
VLM
38
94
0
13 Nov 2023
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation
Junyang Wang
Yuhang Wang
Guohai Xu
Jing Zhang
Yukai Gu
...
Jiaqi Wang
Haiyang Xu
Ming Yan
Ji Zhang
Jitao Sang
MLLM
VLM
22
104
0
13 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
41
15
0
13 Nov 2023
InfMLLM: A Unified Framework for Visual-Language Tasks
Qiang-feng Zhou
Zhibin Wang
Wei Chu
Yinghui Xu
Hao Li
Yuan Qi
MLLM
29
12
0
12 Nov 2023
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
MLLM
28
31
0
11 Nov 2023
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
MLLM
50
247
0
11 Nov 2023
Analyzing Modular Approaches for Visual Question Decomposition
Apoorv Khandelwal
Ellie Pavlick
Chen Sun
50
4
0
10 Nov 2023
Towards A Unified Neural Architecture for Visual Recognition and Reasoning
Calvin Luo
Boqing Gong
Ting Chen
Chen Sun
OCL
ObjD
34
1
0
10 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
50
144
0
10 Nov 2023
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
47
2
0
08 Nov 2023
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs
Zhenfang Chen
Rui Sun
Wenjun Liu
Yining Hong
Chuang Gan
LRM
33
14
0
08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion
Georgios Pantazopoulos
Malvina Nikandrou
Amit Parekh
Bhathiya Hemanthage
Arash Eshghi
Ioannis Konstas
Verena Rieser
Oliver Lemon
Alessandro Suglia
LM&Ro
39
7
0
07 Nov 2023
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye
Haiyang Xu
Jiabo Ye
Mingshi Yan
Anwen Hu
Haowei Liu
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLM
VLM
129
389
0
07 Nov 2023
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Junyan Li
Delin Chen
Yining Hong
Zhenfang Chen
Peihao Chen
Yikang Shen
Chuang Gan
MLLM
38
15
0
06 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
44
5
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
42
10
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
59
36
0
01 Nov 2023
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?
Yichi Zhang
Jiayi Pan
Yuchen Zhou
Rui Pan
Joyce Chai
VLM
29
13
0
31 Oct 2023
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts
Deepanway Ghosal
Navonil Majumder
Roy Ka-wei Lee
Rada Mihalcea
Soujanya Poria
38
7
0
31 Oct 2023
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRM
CoGe
19
98
0
30 Oct 2023
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Changsheng Lv
Shuai Zhang
Yapeng Tian
Mengshi Qi
Huadong Ma
CML
46
16
0
30 Oct 2023
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
Khiem Vinh Tran
Hao Phu Phan
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
34
5
0
27 Oct 2023
3D-Aware Visual Question Answering about Parts, Poses and Occlusions
Xingrui Wang
Wufei Ma
Zhuowan Li
Adam Kortylewski
Alan Yuille
CoGe
29
12
0
27 Oct 2023
Impressions: Understanding Visual Semiotics and Aesthetic Impact
Julia Kruk
Caleb Ziems
Diyi Yang
32
2
0
27 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
28
7
0
26 Oct 2023
AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors
You-Ming Chang
Chen Yeh
Wei-Chen Chiu
Ning Yu
VPVLM
VLM
83
23
0
26 Oct 2023
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan
B. Vijaykumar
S. Schulter
Manmohan Chandraker
Yun Fu
ReLM
25
10
0
25 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
35
9
0
25 Oct 2023
An Early Evaluation of GPT-4V(ision)
Yang Wu
Shilong Wang
Hao Yang
Tian Zheng
Hongbo Zhang
Yanyan Zhao
Bing Qin
MLLM
ELM
9
35
0
25 Oct 2023
Knowledge Editing for Large Language Models: A Survey
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Wenlin Yao
KELM
81
138
0
24 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
44
2
0
24 Oct 2023
Emergent Communication in Interactive Sketch Question Answering
Zixing Lei
Yiming Zhang
Yuxin Xiong
Siheng Chen
40
2
0
24 Oct 2023
Multimodal Representations for Teacher-Guided Compositional Visual Reasoning
Wafa Aissa
Marin Ferecatu
M. Crucianu
LRM
28
0
0
24 Oct 2023
LXMERT Model Compression for Visual Question Answering
Maryam Hashemi
Ghazaleh Mahmoudi
Sara Kodeiri
Hadi Sheikhi
Sauleh Eetemadi
VLM
29
4
0
23 Oct 2023
Large Language Models are Visual Reasoning Coordinators
Liangyu Chen
Bo Li
Sheng Shen
Jingkang Yang
Chunyuan Li
Kurt Keutzer
Trevor Darrell
Ziwei Liu
VLM
LRM
43
51
0
23 Oct 2023
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Zhecan Wang
Long Chen
Haoxuan You
Keyang Xu
Yicheng He
Wenhao Li
Noal Codella
Kai-Wei Chang
Shih-Fu Chang
35
3
0
23 Oct 2023
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan
Fuxiao Liu
Xiyang Wu
Ruiqi Xian
Zongxia Li
...
Lichang Chen
Furong Huang
Yaser Yacoob
Dinesh Manocha
Dinesh Manocha
VLM
MLLM
42
157
0
23 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
Baohao Liao
Michael Kozielski
Sanjika Hewavitharana
Jiangbo Yuan
Shahram Khadivi
Tomer Lancewicki
SSL
23
0
0
22 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-xiong Wang
58
25
0
19 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
22
1
0
18 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
35
5
0
17 Oct 2023
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
Rujie Wu
Xiaojian Ma
Zhenliang Zhang
Wei Wang
Qing Li
Song-Chun Zhu
Yizhou Wang
LRM
VLM
41
7
0
16 Oct 2023
VLIS: Unimodal Language Models Guide Multimodal Language Generation
Jiwan Chung
Youngjae Yu
VLM
37
1
0
15 Oct 2023
Beyond Segmentation: Road Network Generation with Multi-Modal LLMs
Sumedh Rasal
Sanjay K. Boddhu
40
5
0
15 Oct 2023
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining
Zhexiong Liu
Mohamed Elarby
Yang Zhong
Diane Litman
19
11
0
15 Oct 2023
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
168
448
0
14 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
41
94
0
13 Oct 2023
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
Xiangyu Zhao
Bo Liu
Qijiong Liu
Guangyuan Shi
Xiao-Ming Wu
VLM
DiffM
29
7
0
13 Oct 2023
Previous
1
2
3
...
16
17
18
...
38
39
40
Next