Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models
Indranil Sur
Karan Sikka
Matthew Walmer
K. Koneripalli
Anirban Roy
Xiaoyu Lin
Ajay Divakaran
Susmit Jha
64
9
0
07 Aug 2023
A Symbolic Character-Aware Model for Solving Geometry Problems
Maizhen Ning
Qiufeng Wang
Kaizhu Huang
Xiaowei Huang
77
18
0
05 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
187
721
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
138
88
0
03 Aug 2023
Grounded Image Text Matching with Mismatched Relation Reasoning
Yu Wu
Yan-Tao Wei
Haozhe Jasper Wang
Yongfei Liu
Sibei Yang
Xuming He
82
6
0
02 Aug 2023
Making the V in Text-VQA Matter
Shamanthak Hegde
Soumya Jahagirdar
Shankar Gangisetty
CoGe
87
4
0
01 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
58
2
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
128
46
0
30 Jul 2023
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li
Rui Wang
Guangzhi Wang
Yuying Ge
Yixiao Ge
Ying Shan
MLLM
ELM
138
572
0
30 Jul 2023
Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering
N. Naik
Christopher Potts
Elisa Kreiss
87
4
0
28 Jul 2023
Towards Generalist Biomedical AI
Tao Tu
Shekoofeh Azizi
Danny Driess
M. Schaekermann
Mohamed Amin
...
Yossi Matias
K. Singhal
Peter R. Florence
Alan Karthikesalingam
Vivek Natarajan
LM&MA
MedIm
AI4MH
139
277
0
26 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
65
5
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
150
128
0
25 Jul 2023
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Jingxuan Wei
Cheng Tan
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Bihui Yu
R. Guo
Stan Z. Li
LRM
127
12
0
24 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
Conformal prediction under ambiguous ground truth
David Stutz
Abhijit Guha Roy
Tatiana Matejovicova
Patricia Strachan
A. Cemgil
Arnaud Doucet
206
20
0
18 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
74
9
0
17 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Kiet Van Nguyen
54
2
0
17 Jul 2023
Planting a SEED of Vision in Large Language Model
Yuying Ge
Yixiao Ge
Ziyun Zeng
Xintao Wang
Ying Shan
VLM
MLLM
53
98
0
16 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
94
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
110
31
0
13 Jul 2023
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle
Abhay Jain
Radu Timofte
Goran Glavaš
VLM
MLLM
123
32
0
13 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
213
1,060
0
12 Jul 2023
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
149
138
0
11 Jul 2023
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration
Hoang Nguyen
Chenwei Zhang
Tao Zhang
Eugene Rohrbaugh
Philip S. Yu
78
7
0
10 Jul 2023
SVIT: Scaling up Visual Instruction Tuning
Bo Zhao
Boya Wu
Muyang He
Tiejun Huang
MLLM
110
128
0
09 Jul 2023
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Netta Madvil
Yonatan Bitton
Roy Schwartz
74
3
0
06 Jul 2023
Several categories of Large Language Models (LLMs): A Short Survey
Saurabh Pahune
Manoj Chandrasekharan
AILaw
66
17
0
05 Jul 2023
Localized Questions in Medical Visual Question Answering
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
79
8
0
03 Jul 2023
Visual Instruction Tuning with Polite Flamingo
Delong Chen
Jianfeng Liu
Wenliang Dai
Baoyuan Wang
MLLM
126
48
0
03 Jul 2023
JourneyDB: A Benchmark for Generative Image Understanding
Keqiang Sun
Junting Pan
Yuying Ge
Hao Li
Haodong Duan
...
Yi Wang
Jifeng Dai
Yu Qiao
Limin Wang
Hongsheng Li
135
120
0
03 Jul 2023
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Rui Sun
Zhecan Wang
Haoxuan You
Noel Codella
Kai-Wei Chang
Shih-Fu Chang
CLIP
137
4
0
03 Jul 2023
S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture
Ye Xue
Diego Klabjan
J. Utke
40
0
0
01 Jul 2023
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
A. S. Penamakuri
Manish Gupta
Mithun Das Gupta
Anand Mishra
78
7
0
29 Jun 2023
Deep Equilibrium Multimodal Fusion
Jinhong Ni
Yalong Bai
Wei Zhang
Ting Yao
Tao Mei
94
1
0
29 Jun 2023
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
William Berrios
Gautam Mittal
Tristan Thrush
Douwe Kiela
Amanpreet Singh
MLLM
VLM
72
61
0
28 Jun 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Yiyi Zhou
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLM
VPVLM
42
2
0
27 Jun 2023
FunQA: Towards Surprising Video Comprehension
Binzhu Xie
Sicheng Zhang
Zitang Zhou
Yue Liu
Yuanhan Zhang
Jack Hessel
Jingkang Yang
Ziwei Liu
150
24
0
26 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
45
5
0
25 Jun 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELM
MLLM
206
860
0
23 Jun 2023
TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter
Binjie Zhang
Yixiao Ge
Xuyuan Xu
Ying Shan
Mike Zheng Shou
103
8
0
22 Jun 2023
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
Elizaveta Semenova
F. G. Abrantes
Hanwen Zhu
Grace A. Sodunke
Aleksandar Shtedritski
Hannah Rose Kirk
CoGe
125
46
0
21 Jun 2023
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Rabiul Awal
Le Zhang
Aishwarya Agrawal
LRM
149
13
0
16 Jun 2023
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories
Thomas Mensink
J. Uijlings
Lluis Castrejon
A. Goel
Felipe Cadar
Howard Zhou
Fei Sha
A. Araújo
V. Ferrari
96
44
0
15 Jun 2023
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Chenyang Lyu
Minghao Wu
Longyue Wang
Xinting Huang
Bingshuai Liu
Zefeng Du
Shuming Shi
Zhaopeng Tu
MLLM
AuLLM
88
173
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLM
CLIP
91
9
0
15 Jun 2023
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Isha Rawal
Alexander Matyasko
Shantanu Jaiswal
Basura Fernando
Cheston Tan
78
3
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
69
13
0
15 Jun 2023
Improving Selective Visual Question Answering by Learning from Your Peers
Corentin Dancette
Spencer Whitehead
Rishabh Maheshwary
Ramakrishna Vedantam
Stefan Scherer
Xinlei Chen
Matthieu Cord
Marcus Rohrbach
AAML
OOD
89
17
0
14 Jun 2023
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Lingxi Xie
Longhui Wei
Xiaopeng Zhang
Kaifeng Bi
Xiaotao Gu
Jianlong Chang
Qi Tian
95
7
0
14 Jun 2023
Previous
1
2
3
...
20
21
22
...
39
40
41
Next