Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
125
50
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
191
788
0
28 Mar 2023
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Joey Tianyi Zhou
Aniruddha Kembhavi
87
3
0
28 Mar 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Chunpu Xu
Jing Li
VLM
62
5
0
27 Mar 2023
Curriculum Learning for Compositional Visual Reasoning
Wafa Aissa
Marin Ferecatu
M. Crucianu
LRM
87
3
0
27 Mar 2023
Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts
Kastan Day
D. Christl
Rohan Salvi
Pranav Sriram
ViT
81
1
0
24 Mar 2023
Top-Down Visual Attention from Analysis by Synthesis
Baifeng Shi
Trevor Darrell
Xin Eric Wang
88
32
0
23 Mar 2023
Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering
T. M. Thai
Son T. Luu
86
0
0
22 Mar 2023
MAGVLT: Masked Generative Vision-and-Language Transformer
Sungwoong Kim
DaeJin Jo
Donghoon Lee
Jongmin Kim
VLM
60
12
0
21 Mar 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Yushi Hu
Benlin Liu
Jungo Kasai
Yizhong Wang
Mari Ostendorf
Ranjay Krishna
Noah A. Smith
EGVM
95
239
0
21 Mar 2023
Large AI Models in Health Informatics: Applications, Challenges, and the Future
Jianing Qiu
Lin Li
Jiankai Sun
Jiachuan Peng
Peilun Shi
...
Bo Xiao
Wu Yuan
Ningli Wang
Dong Xu
Benny Lo
AI4MH
LM&MA
116
142
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
74
31
0
20 Mar 2023
3D Concept Learning and Reasoning from Multi-View Images
Yining Hong
Chun-Tse Lin
Yilun Du
Zhenfang Chen
J. Tenenbaum
Chuang Gan
3DV
94
52
0
20 Mar 2023
SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage
Song Park
Sanghyuk Chun
Byeongho Heo
Wonjae Kim
Sangdoo Yun
VLM
ViT
97
8
0
20 Mar 2023
FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering
Weizhe Lin
Zhilin Wang
Bill Byrne
AAML
110
4
0
19 Mar 2023
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning
Shi Chen
Qi Zhao
101
6
0
18 Mar 2023
Data Roaming and Quality Assessment for Composed Image Retrieval
Matan Levy
Rami Ben-Ari
N. Darshan
Dani Lischinski
105
28
0
16 Mar 2023
Logical Implications for Visual Question Answering Consistency
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
81
9
0
16 Mar 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Shih-Han Chou
James J. Little
Leonid Sigal
74
2
0
14 Mar 2023
Vision-Language Models as Success Detectors
Yuqing Du
Ksenia Konyushkova
Misha Denil
A. Raju
Jessica Landon
Felix Hill
Nando de Freitas
Serkan Cabi
MLLM
LRM
130
86
0
13 Mar 2023
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Nitzan Bitton-Guetta
Yonatan Bitton
Jack Hessel
Ludwig Schmidt
Yuval Elovici
Gabriel Stanovsky
Roy Schwartz
VLM
228
70
0
13 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
81
68
0
13 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
107
1
0
13 Mar 2023
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu
Jun Chen
Kilichbek Haydarov
Xiaoqian Shen
Wenxuan Zhang
Mohamed Elhoseiny
MLLM
100
106
0
12 Mar 2023
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Qian Jiang
Changyou Chen
Han Zhao
Liqun Chen
Q. Ping
S. D. Tran
Yi Xu
Belinda Zeng
Trishul Chilimbi
101
43
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
155
77
0
10 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
61
1
0
09 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
89
2
0
09 Mar 2023
Graph Neural Networks in Vision-Language Image Understanding: A Survey
Henry Senior
Greg Slabaugh
Shanxin Yuan
Luca Rossi
GNN
97
21
0
07 Mar 2023
PaLM-E: An Embodied Multimodal Language Model
Danny Driess
F. Xia
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
...
Marc Toussaint
Klaus Greff
Andy Zeng
Igor Mordatch
Peter R. Florence
LM&Ro
166
1,679
0
06 Mar 2023
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kan Chen
Xiangqian Wu
CoGe
59
9
0
05 Mar 2023
Knowledge-Based Counterfactual Queries for Visual Question Answering
Theodoti Stoikou
Maria Lymperaiou
Giorgos Stamou
AAML
80
1
0
05 Mar 2023
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu
Xuecheng Ouyang
Zhenwei Shao
Mei Wang
Jun Yu
MLLM
198
11
0
03 Mar 2023
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jingjing Jiang
Nanning Zheng
MoE
124
6
0
02 Mar 2023
VQA with Cascade of Self- and Co-Attention Blocks
Aakansha Mishra
Ashish Anand
Prithwijit Guha
44
1
0
28 Feb 2023
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLM
LRM
MLLM
148
567
0
27 Feb 2023
Medical visual question answering using joint self-supervised learning
Yuan Zhou
Jing Mei
Yiqin Yu
Tanveer Syeda-Mahmood
MedIm
52
1
0
25 Feb 2023
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Paul Pu Liang
Yun Cheng
Xiang Fan
Chun Kai Ling
Suzanne Nie
...
Nicholas B. Allen
Randy P. Auerbach
Faisal Mahmood
Ruslan Salakhutdinov
Louis-Philippe Morency
116
37
0
23 Feb 2023
Learning Visual Representations via Language-Guided Sampling
Mohamed El Banani
Karan Desai
Justin Johnson
SSL
VLM
124
28
0
23 Feb 2023
EVJVQA Challenge: Multilingual Visual Question Answering
Ngan Luu-Thuy Nguyen
Nghia Hieu Nguyen
Duong T.D. Vo
K. Tran
Kiet Van Nguyen
95
7
0
23 Feb 2023
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
152
94
0
23 Feb 2023
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Hexiang Hu
Yi Luan
Yang Chen
Urvashi Khandelwal
Mandar Joshi
Kenton Lee
Kristina Toutanova
Ming-Wei Chang
VLM
136
61
0
22 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
192
216
0
20 Feb 2023
Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning
Xinyue Hu
Lin Gu
Kazuma Kobayashi
Qi A. An
Qingyu Chen
Zhiyong Lu
Chang Su
Tatsuya Harada
Yingying Zhu
GNN
71
10
0
19 Feb 2023
Few-shot Multimodal Multitask Multilingual Learning
Aman Chadha
Vinija Jain
125
0
0
19 Feb 2023
Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering
T. Yamane
Pang-jo Chun
Jiachen Dang
Takayuki Okatani
32
0
0
18 Feb 2023
Multimodal Federated Learning via Contrastive Representation Ensemble
Qiying Yu
Yang Liu
Yimu Wang
Ke Xu
Jingjing Liu
86
90
0
17 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
146
47
0
14 Feb 2023
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Haoyu Lu
Yuqi Huo
Guoxing Yang
Zhiwu Lu
Wei Zhan
Masayoshi Tomizuka
Mingyu Ding
94
36
0
13 Feb 2023
Previous
1
2
3
...
23
24
25
...
39
40
41
Next