Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 1,980 papers shown
Title
Less Learn Shortcut: Analyzing and Mitigating Learning of Spurious Feature-Label Correlation
Yanrui Du
Jing Yang
Yan Chen
Jing Liu
Sendong Zhao
Qiaoqiao She
Huaqin Wu
Haifeng Wang
Bing Qin
69
10
0
25 May 2022
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Aishwarya Agrawal
Ivana Kajić
Emanuele Bugliarello
Elnaz Davoodi
Anita Gergely
Phil Blunsom
Aida Nematzadeh
OOD
45
17
0
24 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
41
215
0
24 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
50
4
0
24 May 2022
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLM
ViT
176
11
0
19 May 2022
Gender and Racial Bias in Visual Question Answering Datasets
Yusuke Hirota
Yuta Nakashima
Noa Garcia
FaML
156
46
0
17 May 2022
What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge
Lovisa Hagström
Richard Johansson
VLM
37
4
0
14 May 2022
Learning to Answer Visual Questions from Web Videos
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
55
34
0
10 May 2022
What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning
Jae Hee Lee
Matthias Kerzel
Kyra Ahrens
C. Weber
S. Wermter
50
9
0
05 May 2022
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
98
1,276
0
04 May 2022
All You May Need for VQA are Image Captions
Soravit Changpinyo
Doron Kukliansky
Idan Szpektor
Xi Chen
Nan Ding
Radu Soricut
37
71
0
04 May 2022
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
A. Piergiovanni
Wei Li
Weicheng Kuo
M. Saffar
Fred Bertsch
A. Angelova
22
16
0
02 May 2022
Visual Spatial Reasoning
Fangyu Liu
Guy Edward Toh Emerson
Nigel Collier
ReLM
70
166
0
30 Apr 2022
SHAPE: An Unified Approach to Evaluate the Contribution and Cooperation of Individual Modalities
Pengbo Hu
Xingyu Li
Yi Zhou
63
10
0
30 Apr 2022
GRIT: General Robust Image Task Benchmark
Tanmay Gupta
Ryan Marten
Aniruddha Kembhavi
Derek Hoiem
VLM
OOD
ObjD
34
32
0
28 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
65
54
0
28 Apr 2022
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations
Roy Schwartz
Gabriel Stanovsky
47
26
0
27 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
26
9
0
23 Apr 2022
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Xiyang Dai
...
Jianwei Yang
Haoxuan You
Kai-Wei Chang
Shih-Fu Chang
Lu Yuan
VLM
OffRL
45
23
0
22 Apr 2022
Attention in Reasoning: Dataset, Analysis, and Modeling
Shi Chen
Ming Jiang
Jinhui Yang
Qi Zhao
LRM
36
3
0
20 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
22
43
0
16 Apr 2022
It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection
Youssef Mohamed
Faizan Farooq Khan
Kilichbek Haydarov
Mohamed Elhoseiny
28
31
0
15 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
60
408
0
07 Apr 2022
CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
LRM
NAI
52
20
0
05 Apr 2022
SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering
Vipul Gupta
Zhuowan Li
Adam Kortylewski
Chenyu Zhang
Yingwei Li
Alan Yuille
35
44
0
05 Apr 2022
On Explaining Multimodal Hateful Meme Detection Models
Ming Shan Hee
Roy Ka-wei Lee
Wen-Haw Chong
VLM
51
39
0
04 Apr 2022
Question-Driven Graph Fusion Network For Visual Question Answering
Yuxi Qian
Yuncong Hu
Ruonan Wang
Fangxiang Feng
Xiaojie Wang
GNN
51
10
0
03 Apr 2022
Co-VQA : Answering by Interactive Sub Question Sequence
Ruonan Wang
Yuxi Qian
Fangxiang Feng
Xiaojie Wang
Huixing Jiang
LRM
35
17
0
02 Apr 2022
SimVQA: Exploring Simulated Environments for Visual Question Answering
Paola Cascante-Bonilla
Hui Wu
Letao Wang
Rogerio Feris
Vicente Ordonez
29
7
0
31 Mar 2022
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo
Yiran Luo
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
43
4
0
30 Mar 2022
Image Retrieval from Contextual Descriptions
Benno Krojer
Vaibhav Adlakha
Vibhav Vineet
Yash Goyal
Edoardo Ponti
Siva Reddy
34
30
0
29 Mar 2022
Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Guangyao Li
Yake Wei
Yapeng Tian
Chenliang Xu
Ji-Rong Wen
Di Hu
39
140
0
26 Mar 2022
Multi-modal Misinformation Detection: Approaches, Challenges and Opportunities
S. Abdali
Sina shaham
Bhaskar Krishnamachari
31
19
0
25 Mar 2022
Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering
Chengyang Fang
Gangyan Zeng
Yu Zhou
Daiqing Wu
Can Ma
Dayong Hu
Weiping Wang
21
8
0
24 Mar 2022
Multilingual CheckList: Generation and Evaluation
Karthikeyan K
Shaily Bhatt
Pankaj Singh
Somak Aditya
Sandipan Dandapat
Sunayana Sitaram
Monojit Choudhary
ELM
40
1
0
24 Mar 2022
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Zhou Yu
Zitian Jin
Jun Yu
Mingliang Xu
Hongbo Wang
Jianping Fan
38
4
0
24 Mar 2022
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
Yu Huang
Junyang Lin
Chang Zhou
Hongxia Yang
Longbo Huang
27
91
0
23 Mar 2022
Finding Structural Knowledge in Multimodal-BERT
Victor Milewski
Miryam de Lhoneux
Marie-Francine Moens
43
9
0
17 Mar 2022
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering
Yang Ding
Jing Yu
Bangchang Liu
Yue Hu
Mingxin Cui
Qi Wu
18
63
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
19
21
0
17 Mar 2022
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene
Duo Zheng
Fandong Meng
Q. Si
Hairun Fan
Zipeng Xu
Jie Zhou
Fangxiang Feng
Xiaojie Wang
37
0
0
16 Mar 2022
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
Xiao Liu
Da Yin
Yansong Feng
Dongyan Zhao
LRM
34
45
0
15 Mar 2022
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Sheng Liang
Mengjie Zhao
Hinrich Schütze
43
42
0
15 Mar 2022
K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition
Kohei Uehara
Tatsuya Harada
86
10
0
15 Mar 2022
Can you even tell left from right? Presenting a new challenge for VQA
Sairaam Venkatraman
Rishi Rao
S. Balasubramanian
C. Vorugunti
R. R. Sarma
CoGe
30
0
0
15 Mar 2022
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
Carlos E. Jimenez
Olga Russakovsky
Karthik Narasimhan
CoGe
39
14
0
15 Mar 2022
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer
Woojeong Jin
Dong-Ho Lee
Chenguang Zhu
Jay Pujara
Xiang Ren
CLIP
VLM
27
10
0
14 Mar 2022
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song
Li Dong
Weinan Zhang
Ting Liu
Furu Wei
VLM
CLIP
33
136
0
14 Mar 2022
The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning
Jessica Hullman
Sayash Kapoor
Priyanka Nanayakkara
Andrew Gelman
Arvind Narayanan
43
39
0
12 Mar 2022
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
VLM
36
91
0
12 Mar 2022
Previous
1
2
3
...
26
27
28
...
38
39
40
Next