Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1906.00067
Cited By
v1
v2 (latest)
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
31 May 2019
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge"
50 / 781 papers shown
Title
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLM
SyDa
107
7
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
81
6
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
Shitian Zhao
Zhuowan Li
Yadong Lu
Alan Yuille
Yan Wang
LRM
73
9
0
09 Dec 2023
GlitchBench: Can large multimodal models detect video game glitches?
Mohammad Reza Taesiri
Tianjun Feng
Anh Totti Nguyen
Cor-Paul Bezemer
MLLM
VLM
LRM
121
11
0
08 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
96
17
0
08 Dec 2023
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
Yushi Hu
Otilia Stretcu
Chun-Ta Lu
Krishnamurthy Viswanathan
Kenji Hata
Enming Luo
Ranjay Krishna
Ariel Fuxman
VLM
LRM
MLLM
126
37
0
05 Dec 2023
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Rizhao Cai
Zirui Song
Dayan Guan
Zhenhao Chen
Xing Luo
Chenyu Yi
Alex C. Kot
MLLM
VLM
103
35
0
05 Dec 2023
Recursive Visual Programming
Jiaxin Ge
Sanjay Subramanian
Baifeng Shi
Roei Herzig
Trevor Darrell
46
7
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
108
22
0
04 Dec 2023
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
Jialin Wu
Xia Hu
Yaqing Wang
Bo Pang
Radu Soricut
MoE
80
16
0
01 Dec 2023
Dolphins: Multimodal Language Model for Driving
Yingzi Ma
Yulong Cao
Jiachen Sun
Marco Pavone
Chaowei Xiao
MLLM
109
64
0
01 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
146
61
0
30 Nov 2023
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
93
12
0
30 Nov 2023
Understanding and Improving In-Context Learning on Vision-language Models
Shuo Chen
Zhen Han
Bailan He
Mark Buckley
Philip Torr
Volker Tresp
Jindong Gu
80
7
0
29 Nov 2023
Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
Xiujun Li
Yujie Lu
Zhe Gan
Jianfeng Gao
William Y. Wang
Yejin Choi
VLM
MLLM
79
3
0
29 Nov 2023
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Xin Liu
Yichen Zhu
Jindong Gu
Yunshi Lan
Chao Yang
Yu Qiao
137
109
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
60
3
0
29 Nov 2023
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han
Fangrui Zhu
Qianru Lao
Huaizu Jiang
ObjD
147
12
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
166
507
0
28 Nov 2023
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
Zhihe Lu
Jiawang Bai
Xin Li
Zeyu Xiao
Xinchao Wang
VLM
76
12
0
28 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLM
LRM
111
98
0
27 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
344
960
0
27 Nov 2023
Continual Instruction Tuning for Large Multimodal Models
Jinghan He
Haiyun Guo
Ming Tang
Jinqiao Wang
VLM
MLLM
CLL
KELM
85
26
0
27 Nov 2023
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Baotian Hu
Wei Wang
Xiaochun Cao
Min Zhang
72
5
0
27 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online Communities
Chongyan Chen
Mengchen Liu
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
108
5
0
27 Nov 2023
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Yuanfeng Ji
Chongjian Ge
Weikai Kong
Enze Xie
Zhengying Liu
Zhengguo Li
Ping Luo
MLLM
ELM
91
7
0
24 Nov 2023
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
Qifan Yu
Juncheng Li
Longhui Wei
Liang Pang
Wentao Ye
Bosheng Qin
Siliang Tang
Qi Tian
Yueting Zhuang
MLLM
VLM
116
82
0
22 Nov 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Jiaqi Wang
Feng Zhao
Dahua Lin
MLLM
VLM
202
683
0
21 Nov 2023
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
Minghe Gao
Juncheng Li
Hao Fei
Liang Pang
Wei Ji
Guoming Wang
Wenqiao Zhang
Siliang Tang
Yueting Zhuang
73
9
0
21 Nov 2023
Causality is all you need
Ning Xu
Yifei Gao
Hongshuo Tian
Yongdong Zhang
An-An Liu
82
0
0
21 Nov 2023
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Gongwei Chen
Leyang Shen
Rui Shao
Xiang Deng
Liqiang Nie
VLM
MLLM
146
48
0
20 Nov 2023
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions
Ziyue Wang
Chi Chen
Peng Li
Yang Liu
LRM
78
16
0
20 Nov 2023
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models
Xiaotian Han
Quanzeng You
Yongfei Liu
Wentao Chen
Huangjie Zheng
...
Yiqi Wang
Bohan Zhai
Jianbo Yuan
Heng Wang
Hongxia Yang
ReLM
LRM
ELM
157
10
0
20 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
129
72
0
16 Nov 2023
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Ziyi Lin
Chris Liu
Renrui Zhang
Peng Gao
Longtian Qiu
...
Siyuan Huang
Yichi Zhang
Xuming He
Hongsheng Li
Yu Qiao
MLLM
VLM
106
231
0
13 Nov 2023
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang
Lingchen Meng
Zejia Weng
Bo He
Zuxuan Wu
Yu-Gang Jiang
MLLM
VLM
121
108
0
13 Nov 2023
A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering
Yunxin Li
Longyue Wang
Baotian Hu
Xinyu Chen
Wanqi Zhong
Chenyang Lyu
Wei Wang
Min Zhang
ELM
63
22
0
13 Nov 2023
Large Language Models for Robotics: A Survey
Fanlong Zeng
Wensheng Gan
Yongheng Wang
Ning Liu
Philip S. Yu
LM&Ro
190
140
0
13 Nov 2023
InfMLLM: A Unified Framework for Visual-Language Tasks
Qiang-feng Zhou
Zhibin Wang
Wei Chu
Yinghui Xu
Hao Li
Yuan Qi
MLLM
60
12
0
12 Nov 2023
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi
Lewei Yao
Jiahui Gao
Jipeng Zhang
Tong Zhang
MLLM
91
36
0
11 Nov 2023
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li
Biao Yang
Qiang Liu
Zhiyin Ma
Shuo Zhang
Jingxu Yang
Yabo Sun
Yuliang Liu
Xiang Bai
MLLM
133
277
0
11 Nov 2023
Analyzing Modular Approaches for Visual Question Decomposition
Apoorv Khandelwal
Ellie Pavlick
Chen Sun
82
4
0
10 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
111
174
0
10 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLM
MLLM
100
66
0
07 Nov 2023
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye
Haiyang Xu
Jiabo Ye
Mingshi Yan
Anwen Hu
Haowei Liu
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLM
VLM
260
422
0
07 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
170
517
0
06 Nov 2023
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Yifan Du
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Jinpeng Wang
Chuyuan Wang
Mingchen Cai
Ruihua Song
Ji-Rong Wen
VLM
MLLM
LRM
185
23
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
103
11
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
151
44
0
01 Nov 2023
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese
Khiem Vinh Tran
Hao Phu Phan
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
49
7
0
27 Oct 2023
Previous
1
2
3
...
9
10
11
...
14
15
16
Next