ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXivPDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 1,977 papers shown
Title
VLMAE: Vision-Language Masked Autoencoder
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
51
11
0
19 Aug 2022
ILLUME: Rationalizing Vision-Language Models through Human Interactions
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Manuel Brack
P. Schramowski
Bjorn Deiseroth
Kristian Kersting
VLM
MLLM
36
3
0
17 Aug 2022
Aesthetic Visual Question Answering of Photographs
Aesthetic Visual Question Answering of Photographs
Xin Jin
Wu Zhou
Xinghui Zhou
Shuai Cui
Le Zhang
Jianwen Lv
Shu Zhao
CoGe
31
0
0
10 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
  Pre-training
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
28
11
0
08 Aug 2022
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset
  for Multi-Modal Understanding
ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding
Bingning Wang
Feiya Lv
Ting Yao
Yiming Yuan
Jin Ma
Yu Luo
Haijin Liang
36
3
0
05 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
37
30
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
41
67
0
03 Aug 2022
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
Jun Wang
M. Gao
Yuqian Hu
Ramprasaath R. Selvaraju
Chetan Ramaiah
Ran Xu
Joseph Jaja
Larry S. Davis
ViT
35
17
0
03 Aug 2022
Video Question Answering with Iterative Video-Text Co-Tokenization
Video Question Answering with Iterative Video-Text Co-Tokenization
A. Piergiovanni
K. Morton
Weicheng Kuo
Michael S. Ryoo
A. Angelova
39
18
0
01 Aug 2022
Generative Bias for Robust Visual Question Answering
Generative Bias for Robust Visual Question Answering
Jae-Won Cho
Dong-Jin Kim
H. Ryu
In So Kweon
OOD
CML
44
19
0
01 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual
  Semantics
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
30
1
0
31 Jul 2022
Towards Complex Document Understanding By Discrete Reasoning
Towards Complex Document Understanding By Discrete Reasoning
Fengbin Zhu
Wenqiang Lei
Fuli Feng
Chao Wang
Haozhou Zhang
Tat-Seng Chua
45
45
0
25 Jul 2022
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with
  Natural Language Explanations
Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Qian Yang
Yunxin Li
Baotian Hu
Lin Ma
Yuxin Ding
Min Zhang
57
10
0
23 Jul 2022
Semantic-aware Modular Capsule Routing for Visual Question Answering
Semantic-aware Modular Capsule Routing for Visual Question Answering
Yudong Han
Jianhua Yin
Jianlong Wu
Yin-wei Wei
Liqiang Nie
40
7
0
21 Jul 2022
Rethinking Data Augmentation for Robust Visual Question Answering
Rethinking Data Augmentation for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Jun Xiao
OOD
40
42
0
18 Jul 2022
Unifying Event Detection and Captioning as Sequence Generation via
  Pre-Training
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Qi Zhang
Yuqing Song
Qin Jin
30
24
0
18 Jul 2022
A Multibias-mitigated and Sentiment Knowledge Enriched Transformer for
  Debiasing in Multimodal Conversational Emotion Recognition
A Multibias-mitigated and Sentiment Knowledge Enriched Transformer for Debiasing in Multimodal Conversational Emotion Recognition
Jinglin Wang
Fang Ma
Yazhou Zhang
Dawei Song
21
4
0
17 Jul 2022
A Skeleton-aware Graph Convolutional Network for Human-Object
  Interaction Detection
A Skeleton-aware Graph Convolutional Network for Human-Object Interaction Detection
Manli Zhu
Edmond S. L. Ho
Hubert P. H. Shum
3DH
42
3
0
11 Jul 2022
Contrastive Cross-Modal Knowledge Sharing Pre-training for
  Vision-Language Representation Learning and Retrieval
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
Keyu Wen
Zhenshan Tan
Qingrong Cheng
Cheng Chen
X. Gu
VLM
37
0
0
02 Jul 2022
Modern Question Answering Datasets and Benchmarks: A Survey
Modern Question Answering Datasets and Benchmarks: A Survey
Zhen Wang
58
23
0
30 Jun 2022
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual
  Question Answering
EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering
Violetta Shevchenko
Ehsan Abbasnejad
A. Dick
Anton Van Den Hengel
Damien Teney
53
0
0
29 Jun 2022
Consistency-preserving Visual Question Answering in Medical Imaging
Consistency-preserving Visual Question Answering in Medical Imaging
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
MedIm
46
12
0
27 Jun 2022
CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
Tejas Srinivasan
Ting-Yun Chang
Leticia Pinto-Alva
Georgios Chochlakis
Mohammad Rostami
Jesse Thomason
VLM
CLL
47
73
0
18 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
48
42
0
17 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
106
395
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language
  Representation Learning
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
51
64
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
35
85
0
16 Jun 2022
Write and Paint: Generative Vision-Language Models are Unified Modal
  Learners
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Shizhe Diao
Wangchunshu Zhou
Xinsong Zhang
Jiawei Wang
MLLM
AI4CE
39
16
0
15 Jun 2022
LAVENDER: Unifying Video-Language Understanding as Masked Language
  Modeling
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Chung-Ching Lin
Zicheng Liu
Ce Liu
Lijuan Wang
MLLM
VLM
33
82
0
14 Jun 2022
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
  Learning
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
Yi-Lin Sung
Jaemin Cho
Joey Tianyi Zhou
VLM
47
238
0
13 Jun 2022
Language Models are General-Purpose Interfaces
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
35
96
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
30
292
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge
  Discovery and Pre-training
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
22
6
0
11 Jun 2022
Revealing Single Frame Bias for Video-and-Language Learning
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
29
111
0
07 Jun 2022
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge
  Distillation
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Kshitij Gupta
Devansh Gautam
R. Mamidi
VLM
41
3
0
07 Jun 2022
From Pixels to Objects: Cubic Visual Attention for Visual Question
  Answering
From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
Jingkuan Song
Pengpeng Zeng
Lianli Gao
Heng Tao Shen
32
62
0
04 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image
  Paragraph Captioning
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Yujia Xie
Luowei Zhou
Xiyang Dai
Lu Yuan
Nguyen Bach
Ce Liu
Michael Zeng
VLM
MLLM
37
28
0
03 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Revisiting the "Video" in Video-Language Understanding
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
46
158
0
03 Jun 2022
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
21
521
0
03 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
18
45
0
02 Jun 2022
Mitigating Dataset Bias by Using Per-sample Gradient
Mitigating Dataset Bias by Using Per-sample Gradient
Sumyeong Ahn
Seongyoon Kim
Se-Young Yun
54
21
0
31 May 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
65
13
0
30 May 2022
UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and
  Graph Convolutional Networks for Multimedia Automatic Misogyny Identification
UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification
Andrei Paraschiv
M. Dascalu
Dumitru-Clementin Cercel
32
3
0
29 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
71
533
0
27 May 2022
Multimodal Knowledge Alignment with Reinforcement Learning
Multimodal Knowledge Alignment with Reinforcement Learning
Youngjae Yu
Jiwan Chung
Heeseung Yun
Jack Hessel
Jinho Park
...
Prithviraj Ammanabrolu
Rowan Zellers
Ronan Le Bras
Gunhee Kim
Yejin Choi
VLM
123
36
0
25 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally
  Spreading Out Disinformation
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
26
12
0
25 May 2022
Guiding Visual Question Answering with Attention Priors
Guiding Visual Question Answering with Attention Priors
T. Le
Vuong Le
Sunil R. Gupta
Svetha Venkatesh
T. Tran
38
6
0
25 May 2022
Less Learn Shortcut: Analyzing and Mitigating Learning of Spurious
  Feature-Label Correlation
Less Learn Shortcut: Analyzing and Mitigating Learning of Spurious Feature-Label Correlation
Yanrui Du
Jing Yang
Yan Chen
Jing Liu
Sendong Zhao
Qiaoqiao She
Huaqin Wu
Haifeng Wang
Bing Qin
69
10
0
25 May 2022
Reassessing Evaluation Practices in Visual Question Answering: A Case
  Study on Out-of-Distribution Generalization
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Aishwarya Agrawal
Ivana Kajić
Emanuele Bugliarello
Elnaz Davoodi
Anita Gergely
Phil Blunsom
Aida Nematzadeh
OOD
45
17
0
24 May 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
  Skip-connections
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li
Haiyang Xu
Junfeng Tian
Wei Wang
Ming Yan
...
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
Luo Si
VLM
MLLM
36
215
0
24 May 2022
Previous
123...252627...383940
Next