Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Tianlin Li
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
253
742
0
14 Sep 2022
UIT-ViCoV19QA: A Dataset for COVID-19 Community-based Question Answering on Vietnamese Language
T. M. Thai
Ngan Ha-Thao Chu
A. T. Vo
Son T. Luu
69
3
0
14 Sep 2022
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining
Zhexiong Liu
M. Guo
Y. Dai
Diane Litman
85
16
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
193
29
0
12 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
120
8
0
12 Sep 2022
DECK: Behavioral Tests to Improve Interpretability and Generalizability of BERT Models Detecting Depression from Text
Jekaterina Novikova
Ksenia Shkaruta
AI4MH
85
4
0
12 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
114
90
0
07 Sep 2022
Interactive Question Answering Systems: Literature Review
Giovanni Maria Biancofiore
Yashar Deldjoo
Tommaso Di Noia
E. Sciascio
Fedelucio Narducci
111
23
0
04 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
100
27
0
29 Aug 2022
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Stan Weixian Lei
Difei Gao
Jay Zhangjie Wu
Yuxuan Wang
Wei Liu
Meng Zhang
Mike Zheng Shou
81
38
0
24 Aug 2022
Bidirectional Contrastive Split Learning for Visual Question Answering
Yuwei Sun
H. Ochiai
39
2
0
24 Aug 2022
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang
A. Mahjoubfar
Anupama Joshi
106
4
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
56
0
0
23 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
175
647
0
22 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
92
11
0
19 Aug 2022
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Manuel Brack
P. Schramowski
Bjorn Deiseroth
Kristian Kersting
VLM
MLLM
54
3
0
17 Aug 2022
Aesthetic Visual Question Answering of Photographs
Xin Jin
Wu Zhou
Xinghui Zhou
Shuai Cui
Le Zhang
Jianwen Lv
Shu Zhao
CoGe
45
0
0
10 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
95
11
0
08 Aug 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
68
3
0
05 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
91
31
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
92
68
0
03 Aug 2022
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Jun Wang
M. Gao
Yuqian Hu
Ramprasaath R. Selvaraju
Chetan Ramaiah
Ran Xu
Joseph Jaja
Larry S. Davis
ViT
72
18
0
03 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
104
18
0
01 Aug 2022
Generative Bias for Robust Visual Question Answering
Jae-Won Cho
Dong-Jin Kim
H. Ryu
In So Kweon
OOD
CML
109
20
0
01 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
53
1
0
31 Jul 2022
Towards Complex Document Understanding By Discrete Reasoning
Fengbin Zhu
Wenqiang Lei
Fuli Feng
Chao Wang
Haozhou Zhang
Tat-Seng Chua
122
48
0
25 Jul 2022
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang
95
10
0
23 Jul 2022
Semantic-aware Modular Capsule Routing for Visual Question Answering
Yudong Han
Jianhua Yin
Jianlong Wu
Yin-wei Wei
Liqiang Nie
68
8
0
21 Jul 2022
Rethinking Data Augmentation for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Jun Xiao
OOD
88
43
0
18 Jul 2022
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Qi Zhang
Yuqing Song
Qin Jin
85
26
0
18 Jul 2022
A Multibias-mitigated and Sentiment Knowledge Enriched Transformer for Debiasing in Multimodal Conversational Emotion Recognition
Jinglin Wang
Fang Ma
Yazhou Zhang
Dawei Song
44
4
0
17 Jul 2022
A Skeleton-aware Graph Convolutional Network for Human-Object Interaction Detection
Manli Zhu
Edmond S. L. Ho
Hubert P. H. Shum
3DH
63
3
0
11 Jul 2022
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
Keyu Wen
Zhenshan Tan
Qingrong Cheng
Cheng Chen
X. Gu
VLM
80
0
0
02 Jul 2022
Modern Question Answering Datasets and Benchmarks: A Survey
Zhen Wang
85
24
0
30 Jun 2022
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
Violetta Shevchenko
Ehsan Abbasnejad
A. Dick
Anton Van Den Hengel
Damien Teney
75
0
0
29 Jun 2022
Consistency-preserving Visual Question Answering in Medical Imaging
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
MedIm
92
12
0
27 Jun 2022
CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
Tejas Srinivasan
Ting-Yun Chang
Leticia Pinto-Alva
Georgios Chochlakis
Mohammad Rostami
Jesse Thomason
VLM
CLL
107
76
0
18 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
91
44
0
17 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
173
412
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
103
69
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
127
90
0
16 Jun 2022
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Shizhe Diao
Wangchunshu Zhou
Xinsong Zhang
Jiawei Wang
MLLM
AI4CE
99
17
0
15 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
92
84
0
14 Jun 2022
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
Yi-Lin Sung
Jaemin Cho
Joey Tianyi Zhou
VLM
111
246
0
13 Jun 2022
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
80
102
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
106
304
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
59
6
0
11 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
96
115
0
07 Jun 2022
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Kshitij Gupta
Devansh Gautam
R. Mamidi
VLM
81
4
0
07 Jun 2022
From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
Jingkuan Song
Pengpeng Zeng
Lianli Gao
Heng Tao Shen
72
62
0
04 Jun 2022
Previous
1
2
3
...
26
27
28
...
39
40
41
Next