Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1906.00067
Cited By
v1
v2 (latest)
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
31 May 2019
Kenneth Marino
Mohammad Rastegari
Ali Farhadi
Roozbeh Mottaghi
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge"
50 / 781 papers shown
Title
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning
Niki Maria Foteinopoulou
Enjie Ghorbel
Djamila Aouada
133
4
0
01 Oct 2024
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
Shitian Zhao
Renrui Zhang
Xu Luo
Yan Wang
Shanghang Zhang
Peng Gao
91
0
0
01 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLM
MLLM
133
41
1
30 Sep 2024
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
Zicheng Zhang
Ziheng Jia
H. Wu
Chunyi Li
Zijian Chen
...
Wei Sun
Xiaohong Liu
Xiongkuo Min
Weisi Lin
Guangtao Zhai
105
10
0
30 Sep 2024
T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition
Chen Yeh
You-Ming Chang
Wei-Chen Chiu
Ning Yu
66
2
0
29 Sep 2024
Visual Question Decomposition on Multimodal Large Language Models
Haowei Zhang
Jianzhe Liu
Zhen Han
Shuo Chen
Bailan He
Volker Tresp
Zhiqiang Xu
Jindong Gu
157
2
0
28 Sep 2024
TrojVLM: Backdoor Attack Against Vision Language Models
Weimin Lyu
Lu Pang
Tengfei Ma
Haibin Ling
Chao Chen
MLLM
97
11
0
28 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
164
12
0
26 Sep 2024
EAGLE: Egocentric AGgregated Language-video Engine
Jing Bi
Yunlong Tang
Luchuan Song
Ali Vosoughi
Nguyen Nguyen
Chenliang Xu
97
11
0
26 Sep 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLM
VLM
110
13
0
25 Sep 2024
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Hong Chen
Xin Wang
Yuwei Zhou
Bin Huang
Yipeng Zhang
Wei Feng
Houlun Chen
Zeyang Zhang
Siao Tang
Wenwu Zhu
DiffM
124
9
0
23 Sep 2024
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Nam Hyeon-Woo
Moon Ye-Bin
Wonseok Choi
Lee Hyun
Tae-Hyun Oh
CoGe
68
3
0
23 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
171
2
0
19 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLM
VLM
LRM
119
73
0
17 Sep 2024
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha
Vinija Jain
Aman Chadha
70
3
0
14 Sep 2024
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks
Md Zarif Hossain
Ahmed Imteaj
AAML
VLM
81
6
0
11 Sep 2024
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Yi Zhu
Yanpeng Zhou
Chunwei Wang
Yang Cao
Jianhua Han
Lu Hou
Hang Xu
ViT
VLM
114
4
0
06 Sep 2024
An overview of domain-specific foundation model: key technologies, applications and challenges
Haolong Chen
Hanzhi Chen
Zijian Zhao
Kaifeng Han
Guangxu Zhu
Yichen Zhao
Ying Du
Wei Xu
Qingjiang Shi
ALM
VLM
111
5
0
06 Sep 2024
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal Models
Bin Fu
Qiyang Wan
Jialin Li
Ruiping Wang
Xilin Chen
55
0
0
03 Sep 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
68
1
0
30 Aug 2024
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
...
Debing Liu
Bin Xu
Juanzi Li
Yuxiao Dong
Jie Tang
VLM
MLLM
111
121
0
29 Aug 2024
Law of Vision Representation in MLLMs
Shijia Yang
Bohan Zhai
Quanzeng You
Jianbo Yuan
Hongxia Yang
Chenfeng Xu
145
12
0
29 Aug 2024
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
M. Maruf
Arka Daw
Kazi Sajeed Mehrab
Harish Babu Manogaran
Abhilash Neog
...
Wei-Lun Chao
Charles V. Stewart
T. Berger-Wolf
Wasila Dahdul
Anuj Karpatne
CoGe
89
4
0
28 Aug 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu
Yue Liao
Le Zhuo
Chenning Xu
Guanghao Zhang
...
Bolin Li
Zhelun Yu
Si Liu
Hongsheng Li
Hao Jiang
VLM
MoE
63
18
0
28 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
110
26
0
28 Aug 2024
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis
Aishik Nagar
Shantanu Jaiswal
Cheston Tan
ReLM
LRM
60
12
0
27 Aug 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Yiwei Ma
Jiayi Ji
Ke Ye
Weihuang Lin
Zhibin Wang
Yonghan Zheng
Qiang-feng Zhou
Xiaoshuai Sun
Rongrong Ji
123
11
0
26 Aug 2024
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
Suyash Vardhan Mathur
J. Bafna
Kunal Kartik
Harshita Khandelwal
Manish Shrivastava
Vivek Gupta
Joey Tianyi Zhou
Dan Roth
LMTD
113
2
0
25 Aug 2024
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
135
78
0
22 Aug 2024
CluMo: Cluster-based Modality Fusion Prompt for Continual Learning in Visual Question Answering
Yuliang Cai
Mohammad Rostami
CLL
VLM
MLLM
123
4
0
21 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
159
4
0
19 Aug 2024
Quality Assessment in the Era of Large Models: A Survey
Zicheng Zhang
Yingjie Zhou
Chunyi Li
Baixuan Zhao
Xiaohong Liu
Guangtao Zhai
103
12
0
17 Aug 2024
IIU: Independent Inference Units for Knowledge-based Visual Question Answering
Yili Li
Jing Yu
Keke Gai
Gang Xiong
51
0
0
15 Aug 2024
Can Large Language Models Understand Symbolic Graphics Programs?
Zeju Qiu
Weiyang Liu
Haiwen Feng
Zhen Liu
Tim Z. Xiao
Katherine M. Collins
J. Tenenbaum
Adrian Weller
Michael J. Black
Bernhard Schölkopf
123
14
0
15 Aug 2024
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye
Haiyang Xu
Haowei Liu
Anwen Hu
Ming Yan
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLM
VLM
86
139
0
09 Aug 2024
How Well Can Vision Language Models See Image Details?
Chenhui Gou
Abdulwahab Felemban
Faizan Farooq Khan
Deyao Zhu
Jianfei Cai
Hamid Rezatofighi
Mohamed Elhoseiny
VLM
MLLM
100
5
0
07 Aug 2024
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
Weiqi Feng
Yangrui Chen
Shaoyu Wang
Size Zheng
H. Lin
Minlan Yu
MLLM
AI4CE
138
4
0
07 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
166
865
0
06 Aug 2024
Fairness and Bias Mitigation in Computer Vision: A Survey
Sepehr Dehdashtian
Ruozhen He
Yi Li
Guha Balakrishnan
Nuno Vasconcelos
Vicente Ordonez
Vishnu Boddeti
137
5
0
05 Aug 2024
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
Rivik Setty
Chengjin Xu
Vinay Setty
Jian Guo
83
13
0
31 Jul 2024
Autonomous Improvement of Instruction Following Skills via Foundation Models
Zhiyuan Zhou
P. Atreya
Abraham Lee
Homer Walke
Oier Mees
Sergey Levine
95
14
0
30 Jul 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
125
5
0
29 Jul 2024
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
Nitzan Bitton-Guetta
Aviv Slobodkin
Aviya Maimon
Eliya Habba
Royi Rassin
Yonatan Bitton
Idan Szpektor
Amir Globerson
Yuval Elovici
ReLM
VLM
LRM
71
6
0
28 Jul 2024
Data Processing Techniques for Modern Multimodal Models
Yinheng Li
Han Ding
Hang Chen
VLM
87
0
0
27 Jul 2024
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
Xinyu Pi
Mingyuan Wu
Jize Jiang
Haozhen Zheng
Beitong Tian
Chengxiang Zhai
Klara Nahrstedt
Zhiting Hu
VLM
108
1
0
25 Jul 2024
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
Yiwei Ma
Zhibin Wang
Xiaoshuai Sun
Weihuang Lin
Qiang-feng Zhou
Jiayi Ji
Rongrong Ji
MLLM
VLM
105
2
0
23 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
68
2
0
22 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLM
LRM
126
13
0
22 Jul 2024
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
Wenbin An
Feng Tian
Jiahao Nie
Wenkai Shi
Haonan Lin
Yan Chen
Qianying Wang
Y. Wu
Guang Dai
Ping Chen
VLM
94
4
0
22 Jul 2024
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
Haowei Liu
Xi Zhang
Haiyang Xu
Yaya Shi
Chaoya Jiang
...
Ji Zhang
Fei Huang
Chunfen Yuan
Bing Li
Weiming Hu
VLM
94
15
0
21 Jul 2024
Previous
1
2
3
4
5
6
...
14
15
16
Next