ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,037 papers shown
Title
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media
  Knowledge Extraction and Grounding
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
Revanth Reddy Gangi Reddy
Xilin Rui
Manling Li
Xudong Lin
Haoyang Wen
...
Joey Tianyi Zhou
Avirup Sil
Shih-Fu Chang
Alex Schwing
Heng Ji
82
32
0
20 Dec 2021
General Greedy De-bias Learning
General Greedy De-bias Learning
Xinzhe Han
Shuhui Wang
Chi Su
Qingming Huang
Qi Tian
119
9
0
20 Dec 2021
ScanQA: 3D Question Answering for Spatial Scene Understanding
ScanQA: 3D Question Answering for Spatial Scene Understanding
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
M. Kawanabe
115
208
0
20 Dec 2021
Contrastive Vision-Language Pre-training with Limited Resources
Contrastive Vision-Language Pre-training with Limited Resources
Quan Cui
Boyan Zhou
Yu Guo
Weidong Yin
Hao Wu
Osamu Yoshie
Yubo Chen
VLMCLIP
53
34
0
17 Dec 2021
Distilled Dual-Encoder Model for Vision-Language Understanding
Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang
Wenhui Wang
Haichao Zhu
Ming Liu
Bing Qin
Furu Wei
VLMFedML
92
33
0
16 Dec 2021
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
  Reasoning
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
Zhecan Wang
Haoxuan You
Liunian Harold Li
Alireza Zareian
Suji Park
Yiqing Liang
Kai-Wei Chang
Shih-Fu Chang
ReLMLRM
69
33
0
16 Dec 2021
3D Question Answering
3D Question Answering
Shuquan Ye
Dongdong Chen
Songfang Han
Jing Liao
ViT
94
49
0
15 Dec 2021
Dual-Key Multimodal Backdoors for Visual Question Answering
Dual-Key Multimodal Backdoors for Visual Question Answering
Matthew Walmer
Karan Sikka
Indranil Sur
Abhinav Shrivastava
Susmit Jha
AAML
78
38
0
14 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models
  Centered on Linguistic Phenomena
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
110
118
0
14 Dec 2021
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
  Visual Question Answering
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
Jianjian Cao
Xiameng Qin
Sanyuan Zhao
Jianbing Shen
77
21
0
14 Dec 2021
VL-Adapter: Parameter-Efficient Transfer Learning for
  Vision-and-Language Tasks
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
Yi-Lin Sung
Jaemin Cho
Joey Tianyi Zhou
VLMVPVLM
130
360
0
13 Dec 2021
Unified Multimodal Pre-training and Prompt-based Tuning for
  Vision-Language Understanding and Generation
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation
Tianyi Liu
Zuxuan Wu
Wenhan Xiong
Jingjing Chen
Yu-Gang Jiang
VLMMLLM
88
10
0
10 Dec 2021
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical
  Reasoning
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
Yining Hong
Li Yi
J. Tenenbaum
Antonio Torralba
Chuang Gan
76
40
0
09 Dec 2021
FLAVA: A Foundational Language And Vision Alignment Model
FLAVA: A Foundational Language And Vision Alignment Model
Amanpreet Singh
Ronghang Hu
Vedanuj Goswami
Guillaume Couairon
Wojciech Galuba
Marcus Rohrbach
Douwe Kiela
CLIPVLM
159
719
0
08 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
66
6
0
08 Dec 2021
Joint Learning of Localized Representations from Medical Images and
  Reports
Joint Learning of Localized Representations from Medical Images and Reports
Philipp Muller
Georgios Kaissis
Cong Zou
Daniel Munich
225
87
0
06 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Hongsheng Li
Xiaohua Wang
Jifeng Dai
132
133
0
02 Dec 2021
Querying Labelled Data with Scenario Programs for Sim-to-Real Validation
Querying Labelled Data with Scenario Programs for Sim-to-Real Validation
Edward Kim
Jay Shenoy
Sebastian Junges
Daniel J. Fremont
Alberto L. Sangiovanni-Vincentelli
Sanjit A. Seshia
81
3
0
01 Dec 2021
Classification-Regression for Chart Comprehension
Classification-Regression for Chart Comprehension
Matan Levy
Rami Ben-Ari
Dani Lischinski
67
16
0
29 Nov 2021
Two-stage Rule-induction Visual Reasoning on RPMs with an Application to
  Video Prediction
Two-stage Rule-induction Visual Reasoning on RPMs with an Application to Video Prediction
Wentao He
Jianfeng Ren
Ruibin Bai
Xudong Jiang
LRM
70
5
0
24 Nov 2021
Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision
Lu Yuan
Dongdong Chen
Yi-Ling Chen
Noel Codella
Xiyang Dai
...
Zhen Xiao
Jianwei Yang
Michael Zeng
Luowei Zhou
Pengchuan Zhang
VLM
213
907
0
22 Nov 2021
Many Heads but One Brain: Fusion Brain -- a Competition and a Single
  Multimodal Multitask Architecture
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture
Daria Bakshandaeva
Denis Dimitrov
V.Ya. Arkhipkin
Alex Shonenkov
M. Potanin
...
Mikhail Martynov
Anton Voronov
Vera Davydova
E. Tutubalina
Aleksandr Petiushko
110
0
0
22 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating
  Visio-Linguistic Reasoning
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
65
4
0
21 Nov 2021
Medical Visual Question Answering: A Survey
Medical Visual Question Answering: A Survey
Zhihong Lin
Donghao Zhang
Qingyi Tao
Danli Shi
Gholamreza Haffari
Qi Wu
M. He
Z. Ge
116
122
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
78
57
0
19 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
  Concepts
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Yan Zeng
Xinsong Zhang
Hang Li
VLMCLIP
106
308
0
16 Nov 2021
LiT: Zero-Shot Transfer with Locked-image text Tuning
LiT: Zero-Shot Transfer with Locked-image text Tuning
Xiaohua Zhai
Tianlin Li
Basil Mustafa
Andreas Steiner
Daniel Keysers
Alexander Kolesnikov
Lucas Beyer
VLM
200
561
0
15 Nov 2021
Sentiment Analysis of Fashion Related Posts in Social Media
Sentiment Analysis of Fashion Related Posts in Social Media
Yifei Yuan
W. Lam
75
8
0
15 Nov 2021
A Survey of Visual Transformers
A Survey of Visual Transformers
Yang Liu
Yao Zhang
Yixin Wang
Feng Hou
Jin Yuan
Jiang Tian
Yang Zhang
Zhongchao Shi
Jianping Fan
Zhiqiang He
3DGSViT
207
356
0
11 Nov 2021
Graph Relation Transformer: Incorporating pairwise object features into
  the Transformer architecture
Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture
Michael Yang
Aditya Anantharaman
Zach Kitowski
Derik Clive Robert
ViT
64
4
0
11 Nov 2021
Towards Debiasing Temporal Sentence Grounding in Video
Towards Debiasing Temporal Sentence Grounding in Video
Hao Zhang
Aixin Sun
Wei Jing
Qiufeng Wang
103
16
0
08 Nov 2021
VLMo: Unified Vision-Language Pre-Training with
  Mixture-of-Modality-Experts
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao
Wenhui Wang
Li Dong
Qiang Liu
Owais Khan Mohammed
Kriti Aggarwal
Subhojit Som
Furu Wei
VLMMLLMMoE
106
560
0
03 Nov 2021
Introspective Distillation for Robust Question Answering
Introspective Distillation for Robust Question Answering
Yulei Niu
Hanwang Zhang
94
60
0
01 Nov 2021
Perceptual Score: What Data Modalities Does Your Model Perceive?
Perceptual Score: What Data Modalities Does Your Model Perceive?
Itai Gat
Idan Schwartz
Alex Schwing
99
32
0
27 Oct 2021
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual
  Language Reasoning
IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning
Pan Lu
Liang Qiu
Jiaqi Chen
Tony Xia
Yizhou Zhao
Wei Zhang
Zhou Yu
Xiaodan Liang
Song-Chun Zhu
AIMat
177
206
0
25 Oct 2021
Alignment Attention by Matching Key and Query Distributions
Alignment Attention by Matching Key and Query Distributions
Shujian Zhang
Xinjie Fan
Huangjie Zheng
Korawat Tanwisuth
Mingyuan Zhou
OOD
124
10
0
25 Oct 2021
Robustness through Data Augmentation Loss Consistency
Robustness through Data Augmentation Loss Consistency
Tianjian Huang
Shaunak Halbe
Chinnadhurai Sankar
P. Amini
Satwik Kottur
A. Geramifard
Meisam Razaviyayn
Ahmad Beirami
OOD
135
8
0
21 Oct 2021
Single-Modal Entropy based Active Learning for Visual Question Answering
Single-Modal Entropy based Active Learning for Visual Question Answering
Dong-Jin Kim
Jae-Won Cho
Jinsoo Choi
Yunjae Jung
In So Kweon
63
12
0
21 Oct 2021
Domain Generalization through Audio-Visual Relative Norm Alignment in
  First Person Action Recognition
Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition
M. Planamente
Chiara Plizzari
Emanuele Alberti
Barbara Caputo
EgoV
121
35
0
19 Oct 2021
Towards Language-guided Visual Recognition via Dynamic Convolutions
Towards Language-guided Visual Recognition via Dynamic Convolutions
Gen Luo
Yiyi Zhou
Xiaoshuai Sun
Yongjian Wu
Yue Gao
Rongrong Ji
ObjD
98
19
0
17 Oct 2021
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based
  Learning for Vision-Language Models
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
Woojeong Jin
Yu Cheng
Yelong Shen
Weizhu Chen
Xiang Ren
VLMVPVLMMLLM
128
138
0
16 Oct 2021
The World of an Octopus: How Reporting Bias Influences a Language
  Model's Perception of Color
The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color
Cory Paik
Stéphane Aroca-Ouellette
Alessandro Roncone
Katharina Kann
71
34
0
15 Oct 2021
Content Preserving Image Translation with Texture Co-occurrence and
  Spatial Self-Similarity for Texture Debiasing and Domain Adaptation
Content Preserving Image Translation with Texture Co-occurrence and Spatial Self-Similarity for Texture Debiasing and Domain Adaptation
Hao Li
Dongkyu Won
Wei Lu
Philip Chikontwe
Pengjun Xie
June Hong Ahn
Sang Hyun Park
116
37
0
15 Oct 2021
Semantically Distributed Robust Optimization for Vision-and-Language
  Inference
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale
A. Chaudhary
Pratyay Banerjee
Chitta Baral
Yezhou Yang
126
17
0
14 Oct 2021
Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual
  Transformers with Joint Student-Teacher Learning
Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning
Ankit Parag Shah
Shijie Geng
Peng Gao
A. Cherian
Takaaki Hori
Tim K. Marks
Jonathan Le Roux
Chiori Hori
68
24
0
13 Oct 2021
Improving Users' Mental Model with Attention-directed Counterfactual
  Edits
Improving Users' Mental Model with Attention-directed Counterfactual Edits
Kamran Alipour
Arijit Ray
Xiaoyu Lin
Michael Cogswell
J. Schulze
Yi Yao
Giedrius Burachas
OOD
61
9
0
13 Oct 2021
Beyond Accuracy: A Consolidated Tool for Visual Question Answering
  Benchmarking
Beyond Accuracy: A Consolidated Tool for Visual Question Answering Benchmarking
Dirk Vath
Pascal Tilli
Ngoc Thang Vu
87
4
0
11 Oct 2021
Coarse-to-Fine Reasoning for Visual Question Answering
Coarse-to-Fine Reasoning for Visual Question Answering
Binh X. Nguyen
Tuong Khanh Long Do
Huy Tran
Erman Tjiputra
Quang-Dieu Tran
A. Nguyen
NAI
138
40
0
06 Oct 2021
Let there be a clock on the beach: Reducing Object Hallucination in
  Image Captioning
Let there be a clock on the beach: Reducing Object Hallucination in Image Captioning
Ali Furkan Biten
L. G. I. Bigorda
Dimosthenis Karatzas
168
63
0
04 Oct 2021
Counterfactual Samples Synthesizing and Training for Robust Visual
  Question Answering
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering
Long Chen
Yuhang Zheng
Yulei Niu
Hanwang Zhang
Jun Xiao
AAMLOOD
119
37
0
03 Oct 2021
Previous
123...293031...394041
Next