Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene
Duo Zheng
Fandong Meng
Q. Si
Hairun Fan
Zipeng Xu
Jie Zhou
Fangxiang Feng
Xiaojie Wang
80
0
0
16 Mar 2022
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
Xiao Liu
Da Yin
Yansong Feng
Dongyan Zhao
LRM
80
46
0
15 Mar 2022
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Sheng Liang
Mengjie Zhao
Hinrich Schütze
98
45
0
15 Mar 2022
K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition
Kohei Uehara
Tatsuya Harada
98
10
0
15 Mar 2022
Can you even tell left from right? Presenting a new challenge for VQA
Sairaam Venkatraman
Rishi Rao
S. Balasubramanian
C. Vorugunti
R. R. Sarma
CoGe
86
0
0
15 Mar 2022
CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
Carlos E. Jimenez
Olga Russakovsky
Karthik Narasimhan
CoGe
84
14
0
15 Mar 2022
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer
Woojeong Jin
Dong-Ho Lee
Chenguang Zhu
Jay Pujara
Xiang Ren
CLIP
VLM
75
10
0
14 Mar 2022
CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song
Li Dong
Weinan Zhang
Ting Liu
Furu Wei
VLM
CLIP
108
139
0
14 Mar 2022
The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning
Jessica Hullman
Sayash Kapoor
Priyanka Nanayakkara
Andrew Gelman
Arvind Narayanan
147
39
0
12 Mar 2022
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
Wenliang Dai
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
Pascale Fung
VLM
104
94
0
12 Mar 2022
REX: Reasoning-aware and Grounded Explanation
Shi Chen
Qi Zhao
93
18
0
11 Mar 2022
PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks
Nan Ding
Xi Chen
Tomer Levinboim
Soravit Changpinyo
Radu Soricut
86
30
0
10 Mar 2022
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
113
40
0
09 Mar 2022
AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
B. Wong
Joya Chen
You Wu
Stan Weixian Lei
Dongxing Mao
Difei Gao
Mike Zheng Shou
EgoV
77
29
0
08 Mar 2022
HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
Zhengkun Zhang
Wenya Guo
Xiaojun Meng
Yasheng Wang
Yadao Wang
Xin Jiang
Qun Liu
Zhenglu Yang
80
17
0
08 Mar 2022
Image Search with Text Feedback by Additive Attention Compositional Learning
Yuxin Tian
Shawn D. Newsam
K. Boakye
CoGe
70
13
0
08 Mar 2022
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations
Yiwei Lyu
Paul Pu Liang
Zihao Deng
Ruslan Salakhutdinov
Louis-Philippe Morency
97
36
0
03 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
83
37
0
03 Mar 2022
Multi-modal Alignment using Representation Codebook
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
131
68
0
28 Feb 2022
On Modality Bias Recognition and Reduction
Yangyang Guo
Liqiang Nie
Harry Cheng
Zhiyong Cheng
Mohan S. Kankanhalli
A. Bimbo
80
28
0
25 Feb 2022
Vision-Language Pre-Training with Triple Contrastive Learning
Jinyu Yang
Jiali Duan
Son N. Tran
Yi Xu
Sampath Chanda
Liqun Chen
Belinda Zeng
Trishul Chilimbi
Junzhou Huang
VLM
140
300
0
21 Feb 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
186
228
0
18 Feb 2022
When Did It Happen? Duration-informed Temporal Localization of Narrated Actions in Vlogs
Oana Ignat
Santiago Castro
Yuhang Zhou
Jiajun Bao
Dandan Shan
Rada Mihalcea
59
3
0
16 Feb 2022
Saving Dense Retriever from Shortcut Dependency in Conversational Search
Sungdong Kim
Gangwoo Kim
88
27
0
15 Feb 2022
An experimental study of the vision-bottleneck in VQA
Pierre Marza
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
95
1
0
14 Feb 2022
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Jiaxi Gu
Xiaojun Meng
Guansong Lu
Lu Hou
Minzhe Niu
...
Runhu Huang
Wei Zhang
Xingda Jiang
Chunjing Xu
Hang Xu
VLM
187
95
0
14 Feb 2022
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
Nan Wu
Stanislaw Jastrzebski
Kyunghyun Cho
Krzysztof J. Geras
77
76
0
10 Feb 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
Jaemin Cho
Abhaysinh Zala
Joey Tianyi Zhou
ViT
258
193
0
08 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
268
884
0
07 Feb 2022
Catch Me if You Can: A Novel Task for Detection of Covert Geo-Locations (CGL)
Binoy Saha
Sukhendu Das
75
1
0
05 Feb 2022
Webly Supervised Concept Expansion for General Purpose Vision Models
Amita Kamath
Christopher Clark
Tanmay Gupta
Eric Kolve
Derek Hoiem
Aniruddha Kembhavi
VLM
97
55
0
04 Feb 2022
Grounding Answers for Visual Questions Asked by Visually Impaired People
Chongyan Chen
Samreen Anjum
Danna Gurari
111
49
0
04 Feb 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
88
18
0
29 Jan 2022
Rethinking Attention-Model Explainability through Faithfulness Violation Test
Yebin Liu
Haoliang Li
Yangyang Guo
Chen Kong
Jing Li
Shiqi Wang
FAtt
183
43
0
28 Jan 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
591
4,444
0
28 Jan 2022
Recursive Decoding: A Situated Cognition Approach to Compositional Generation in Grounded Language Understanding
Matthew Setzler
Scott Howland
Lauren A. Phillips
LRM
69
5
0
27 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
37
5
0
25 Jan 2022
Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices
Yingying Zhao
Yuhu Chang
Yutian Lu
Yujiang Wang
Mingzhi Dong
...
Robert P. Dick
Fan Yang
Tun Lu
Ning Gu
L. Shang
78
10
0
24 Jan 2022
Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding
Arjun Reddy Akula
OOD
116
3
0
24 Jan 2022
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Jianwei Yang
Xiyang Dai
Bin Xiao
Haoxuan You
Shih-Fu Chang
Lu Yuan
CLIP
VLM
83
40
0
15 Jan 2022
Towards Automated Error Analysis: Learning to Characterize Errors
Tong Gao
Shivang Singh
Raymond J. Mooney
68
1
0
13 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
51
1
0
11 Jan 2022
COIN: Counterfactual Image Generation for VQA Interpretation
Zeyd Boukhers
Timo Hartmann
Jan Jurjens
49
7
0
10 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
135
215
0
07 Jan 2022
Self-Training Vision Language BERTs with a Unified Conditional Model
Xiaofeng Yang
Fengmao Lv
Fayao Liu
Guosheng Lin
SSL
VLM
89
14
0
06 Jan 2022
VisRecall: Quantifying Information Visualisation Recallability via Question Answering
Yao Wang
Chuhan Jiao
Mihai Bâce
Andreas Bulling
148
5
0
30 Dec 2021
Multi-Image Visual Question Answering
Harsh Raj
Janhavi Dadhania
Akhilesh Bhardwaj
Prabuchandran KJ
47
2
0
27 Dec 2021
LaTr: Layout-Aware Transformer for Scene-Text VQA
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
135
102
0
23 Dec 2021
Understanding and Measuring Robustness of Multimodal Learning
Nishant Vishwamitra
Hongxin Hu
Ziming Zhao
Long Cheng
Feng Luo
AAML
86
5
0
22 Dec 2021
Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
Xu Yan
Zhihao Yuan
Yuhao Du
Yinghong Liao
Yao Guo
Zhen Li
Shuguang Cui
3DPC
CoGe
67
17
0
22 Dec 2021
Previous
1
2
3
...
28
29
30
...
39
40
41
Next