Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 1,975 papers shown
Title
Detecting and analyzing missing citations to published scientific entities
Jialiang Lin
Yao Yu
Jia-Qi Song
X. Shi
24
4
0
18 Oct 2022
Meta-Learning via Classifier(-free) Diffusion Guidance
Elvis Nava
Seijin Kobayashi
Yifei Yin
Robert K. Katzschmann
Benjamin Grewe
VLM
32
6
0
17 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
42
47
0
17 Oct 2022
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
A. M. H. Tiong
Junnan Li
Boyang Albert Li
Silvio Savarese
Guosheng Lin
MLLM
42
103
0
17 Oct 2022
Scratching Visual Transformer's Back with Uniform Attention
Nam Hyeon-Woo
Kim Yu-Ji
Byeongho Heo
Doonyoon Han
Seong Joon Oh
Tae-Hyun Oh
393
23
0
16 Oct 2022
Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy
Shiyuan Huang
Robinson Piramuthu
Shih-Fu Chang
Gunnar Sigurdsson
36
1
0
15 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
41
37
0
14 Oct 2022
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma
Silong Yong
Zilong Zheng
Qing Li
Yitao Liang
Song-Chun Zhu
Siyuan Huang
LM&Ro
51
134
0
14 Oct 2022
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Manas
Pau Rodríguez López
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Aishwarya Agrawal
VLM
VPVLM
21
48
0
13 Oct 2022
Foundation Transformers
Hongyu Wang
Shuming Ma
Shaohan Huang
Li Dong
Wenhui Wang
...
Barun Patra
Zhun Liu
Vishrav Chaudhary
Xia Song
Furu Wei
AI4CE
61
27
0
12 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
33
1
0
12 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
33
29
0
11 Oct 2022
Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA
Q. Si
Fandong Meng
Mingyu Zheng
Zheng Lin
Yuanxin Liu
Peng Fu
Yanan Cao
Weiping Wang
Jie Zhou
40
21
0
10 Oct 2022
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning
Q. Si
Yuanxin Liu
Fandong Meng
Zheng Lin
Peng Fu
Yanan Cao
Weiping Wang
Jie Zhou
53
23
0
10 Oct 2022
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Zijia Zhao
Longteng Guo
Xingjian He
Shuai Shao
Zehuan Yuan
Jing Liu
26
9
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
29
2
0
08 Oct 2022
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
Baoxiong Jia
Ting Lei
Song-Chun Zhu
Siyuan Huang
EgoV
42
62
0
08 Oct 2022
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
169
270
0
07 Oct 2022
Vision+X: A Survey on Multimodal Learning in the Light of Data
Ye Zhu
Yuehua Wu
N. Sebe
Yan Yan
55
16
0
05 Oct 2022
Extending Compositional Attention Networks for Social Reasoning in Videos
Christina Sartzetaki
Georgios Paraskevopoulos
Alexandros Potamianos
LRM
31
3
0
03 Oct 2022
Music-to-Text Synaesthesia: Generating Descriptive Text from Music Recordings
Zhihuan Kuang
Shi Zong
Jianbing Zhang
Jiajun Chen
Hongfu Liu
38
4
0
02 Oct 2022
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering
Mavina Nikandrou
Lu Yu
Alessandro Suglia
Ioannis Konstas
Verena Rieser
OOD
32
5
0
30 Sep 2022
Linearly Mapping from Image to Text Space
Jack Merullo
Louis Castricato
Carsten Eickhoff
Ellie Pavlick
VLM
175
107
0
30 Sep 2022
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
66
28
0
28 Sep 2022
Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding
Fengyuan Shi
Ruopeng Gao
Weilin Huang
Limin Wang
35
25
0
28 Sep 2022
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective
Chaoqi Chen
Yushuang Wu
Qiyuan Dai
Hong-Yu Zhou
Mutian Xu
Sibei Yang
Xiaoguang Han
Yizhou Yu
ViT
MedIm
AI4CE
32
74
0
27 Sep 2022
Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
Lichen Zhao
Daigang Cai
Jing Zhang
Lu Sheng
Dong Xu
Ruizhi Zheng
Yinjie Zhao
Lipeng Wang
Xibo Fan
19
24
0
24 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
Ashwin Kalyan
ELM
ReLM
LRM
213
1,155
0
20 Sep 2022
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?
Lovisa Hagström
Richard Johansson
VLM
43
4
0
19 Sep 2022
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances
Yike Wu
Yu Zhao
Shiwan Zhao
Ying Zhang
Xiaojie Yuan
Guoqing Zhao
Ning Jiang
90
17
0
18 Sep 2022
Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP
Benjamin Ledel
Steffen Herbold
FAtt
43
4
0
15 Sep 2022
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li
Junnan Li
Hung Le
Guangsen Wang
Silvio Savarese
Guosheng Lin
VLM
134
54
0
15 Sep 2022
Distribution Aware Metrics for Conditional Natural Language Generation
David M. Chan
Yiming Ni
David A. Ross
Sudheendra Vijayanarasimhan
Austin Myers
John F. Canny
53
4
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
31
8
0
14 Sep 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Tianlin Li
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
42
691
0
14 Sep 2022
UIT-ViCoV19QA: A Dataset for COVID-19 Community-based Question Answering on Vietnamese Language
T. M. Thai
Ngan Ha-Thao Chu
A. T. Vo
Son T. Luu
39
3
0
14 Sep 2022
ImageArg: A Multi-modal Tweet Dataset for Image Persuasiveness Mining
Zhexiong Liu
M. Guo
Y. Dai
Diane Litman
37
15
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
151
29
0
12 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
38
8
0
12 Sep 2022
DECK: Behavioral Tests to Improve Interpretability and Generalizability of BERT Models Detecting Depression from Text
Jekaterina Novikova
Ksenia Shkaruta
AI4MH
42
4
0
12 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
26
65
0
07 Sep 2022
Interactive Question Answering Systems: Literature Review
Giovanni Maria Biancofiore
Yashar Deldjoo
Tommaso Di Noia
E. Sciascio
Fedelucio Narducci
51
14
0
04 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
31
27
0
29 Aug 2022
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Stan Weixian Lei
Difei Gao
Jay Zhangjie Wu
Yuxuan Wang
Wei Liu
Meng Zhang
Mike Zheng Shou
30
36
0
24 Aug 2022
Bidirectional Contrastive Split Learning for Visual Question Answering
Yuwei Sun
H. Ochiai
24
2
0
24 Aug 2022
FashionVQA: A Domain-Specific Visual Question Answering System
Min Wang
A. Mahjoubfar
Anupama Joshi
34
4
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
43
0
0
23 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
89
632
0
22 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
51
11
0
19 Aug 2022
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Manuel Brack
P. Schramowski
Bjorn Deiseroth
Kristian Kersting
VLM
MLLM
36
3
0
17 Aug 2022
Previous
1
2
3
...
24
25
26
...
38
39
40
Next