ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1612.00837
  4. Cited By
Making the V in VQA Matter: Elevating the Role of Image Understanding in
  Visual Question Answering
v1v2v3 (latest)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"

50 / 2,037 papers shown
Title
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
125
50
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init
  Attention
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
191
788
0
28 Mar 2023
Exposing and Addressing Cross-Task Inconsistency in Unified
  Vision-Language Models
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
A. Maharana
Amita Kamath
Christopher Clark
Joey Tianyi Zhou
Aniruddha Kembhavi
87
3
0
28 Mar 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media
  Multimodal Classification
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Chunpu Xu
Jing Li
VLM
62
5
0
27 Mar 2023
Curriculum Learning for Compositional Visual Reasoning
Curriculum Learning for Compositional Visual Reasoning
Wafa Aissa
Marin Ferecatu
M. Crucianu
LRM
87
3
0
27 Mar 2023
Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained
  Experts
Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts
Kastan Day
D. Christl
Rohan Salvi
Pranav Sriram
ViT
81
1
0
24 Mar 2023
Top-Down Visual Attention from Analysis by Synthesis
Top-Down Visual Attention from Analysis by Synthesis
Baifeng Shi
Trevor Darrell
Xin Eric Wang
88
32
0
23 Mar 2023
Integrating Image Features with Convolutional Sequence-to-sequence
  Network for Multilingual Visual Question Answering
Integrating Image Features with Convolutional Sequence-to-sequence Network for Multilingual Visual Question Answering
T. M. Thai
Son T. Luu
86
0
0
22 Mar 2023
MAGVLT: Masked Generative Vision-and-Language Transformer
MAGVLT: Masked Generative Vision-and-Language Transformer
Sungwoong Kim
DaeJin Jo
Donghoon Lee
Jongmin Kim
VLM
60
12
0
21 Mar 2023
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation
  with Question Answering
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
Yushi Hu
Benlin Liu
Jungo Kasai
Yizhong Wang
Mari Ostendorf
Ranjay Krishna
Noah A. Smith
EGVM
95
239
0
21 Mar 2023
Large AI Models in Health Informatics: Applications, Challenges, and the
  Future
Large AI Models in Health Informatics: Applications, Challenges, and the Future
Jianing Qiu
Lin Li
Jiankai Sun
Jiachuan Peng
Peilun Shi
...
Bo Xiao
Wu Yuan
Ningli Wang
Dong Xu
Benny Lo
AI4MHLM&MA
116
142
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLMVLM
74
31
0
20 Mar 2023
3D Concept Learning and Reasoning from Multi-View Images
3D Concept Learning and Reasoning from Multi-View Images
Yining Hong
Chun-Tse Lin
Yilun Du
Zhenfang Chen
J. Tenenbaum
Chuang Gan
3DV
94
52
0
20 Mar 2023
SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel
  Storage
SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage
Song Park
Sanghyuk Chun
Byeongho Heo
Wonjae Kim
Sangdoo Yun
VLMViT
97
8
0
20 Mar 2023
FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual
  Question Answering
FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering
Weizhe Lin
Zhilin Wang
Bill Byrne
AAML
110
4
0
19 Mar 2023
Divide and Conquer: Answering Questions with Object Factorization and
  Compositional Reasoning
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning
Shi Chen
Qi Zhao
101
6
0
18 Mar 2023
Data Roaming and Quality Assessment for Composed Image Retrieval
Data Roaming and Quality Assessment for Composed Image Retrieval
Matan Levy
Rami Ben-Ari
N. Darshan
Dani Lischinski
105
28
0
16 Mar 2023
Logical Implications for Visual Question Answering Consistency
Logical Implications for Visual Question Answering Consistency
Sergio Tascon-Morales
Pablo Márquez-Neila
Raphael Sznitman
81
9
0
16 Mar 2023
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Implicit and Explicit Commonsense for Multi-sentence Video Captioning
Shih-Han Chou
James J. Little
Leonid Sigal
74
2
0
14 Mar 2023
Vision-Language Models as Success Detectors
Vision-Language Models as Success Detectors
Yuqing Du
Ksenia Konyushkova
Misha Denil
A. Raju
Jessica Landon
Felix Hill
Nando de Freitas
Serkan Cabi
MLLMLRM
130
86
0
13 Mar 2023
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of
  Synthetic and Compositional Images
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images
Nitzan Bitton-Guetta
Yonatan Bitton
Jack Hessel
Ludwig Schmidt
Yuval Elovici
Gabriel Stanovsky
Roy Schwartz
VLM
228
70
0
13 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLMMoE
81
68
0
13 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
107
1
0
13 Mar 2023
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched
  Visual Descriptions
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu
Jun Chen
Kilichbek Haydarov
Xiaoqian Shen
Wenxuan Zhang
Mohamed Elhoseiny
MLLM
100
106
0
12 Mar 2023
Understanding and Constructing Latent Modality Structures in Multi-modal
  Representation Learning
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Qian Jiang
Changyou Chen
Han Zhao
Liqun Chen
Q. Ping
S. D. Tran
Yi Xu
Belinda Zeng
Trishul Chilimbi
101
43
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIPMLLMVLM3DV
155
77
0
10 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal
  Pre-training
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
61
1
0
09 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Toward Unsupervised Realistic Visual Question Answering
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
89
2
0
09 Mar 2023
Graph Neural Networks in Vision-Language Image Understanding: A Survey
Graph Neural Networks in Vision-Language Image Understanding: A Survey
Henry Senior
Greg Slabaugh
Shanxin Yuan
Luca Rossi
GNN
97
21
0
07 Mar 2023
PaLM-E: An Embodied Multimodal Language Model
PaLM-E: An Embodied Multimodal Language Model
Danny Driess
F. Xia
Mehdi S. M. Sajjadi
Corey Lynch
Aakanksha Chowdhery
...
Marc Toussaint
Klaus Greff
Andy Zeng
Igor Mordatch
Peter R. Florence
LM&Ro
166
1,679
0
06 Mar 2023
VTQA: Visual Text Question Answering via Entity Alignment and
  Cross-Media Reasoning
VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
Kan Chen
Xiangqian Wu
CoGe
59
9
0
05 Mar 2023
Knowledge-Based Counterfactual Queries for Visual Question Answering
Knowledge-Based Counterfactual Queries for Visual Question Answering
Theodoti Stoikou
Maria Lymperaiou
Giorgos Stamou
AAML
80
1
0
05 Mar 2023
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
Zhou Yu
Xuecheng Ouyang
Zhenwei Shao
Mei Wang
Jun Yu
MLLM
198
11
0
03 Mar 2023
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource
  Visual Question Answering
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jingjing Jiang
Nanning Zheng
MoE
124
6
0
02 Mar 2023
VQA with Cascade of Self- and Co-Attention Blocks
VQA with Cascade of Self- and Co-Attention Blocks
Aakansha Mishra
Ashish Anand
Prithwijit Guha
44
1
0
28 Feb 2023
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLMLRMMLLM
148
567
0
27 Feb 2023
Medical visual question answering using joint self-supervised learning
Medical visual question answering using joint self-supervised learning
Yuan Zhou
Jing Mei
Yiqin Yu
Tanveer Syeda-Mahmood
MedIm
52
1
0
25 Feb 2023
Quantifying & Modeling Multimodal Interactions: An Information
  Decomposition Framework
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Paul Pu Liang
Yun Cheng
Xiang Fan
Chun Kai Ling
Suzanne Nie
...
Nicholas B. Allen
Randy P. Auerbach
Faisal Mahmood
Ruslan Salakhutdinov
Louis-Philippe Morency
116
37
0
23 Feb 2023
Learning Visual Representations via Language-Guided Sampling
Learning Visual Representations via Language-Guided Sampling
Mohamed El Banani
Karan Desai
Justin Johnson
SSLVLM
124
28
0
23 Feb 2023
EVJVQA Challenge: Multilingual Visual Question Answering
EVJVQA Challenge: Multilingual Visual Question Answering
Ngan Luu-Thuy Nguyen
Nghia Hieu Nguyen
Duong T.D. Vo
K. Tran
Kiet Van Nguyen
95
7
0
23 Feb 2023
Can Pre-trained Vision and Language Models Answer Visual
  Information-Seeking Questions?
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?
Yang Chen
Hexiang Hu
Yi Luan
Haitian Sun
Soravit Changpinyo
Alan Ritter
Ming-Wei Chang
152
94
0
23 Feb 2023
Open-domain Visual Entity Recognition: Towards Recognizing Millions of
  Wikipedia Entities
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities
Hexiang Hu
Yi Luan
Yang Chen
Urvashi Khandelwal
Mandar Joshi
Kenton Lee
Kristina Toutanova
Ming-Wei Chang
VLM
136
61
0
22 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
192
216
0
20 Feb 2023
Interpretable Medical Image Visual Question Answering via Multi-Modal
  Relationship Graph Learning
Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning
Xinyue Hu
Lin Gu
Kazuma Kobayashi
Qi A. An
Qingyu Chen
Zhiyong Lu
Chang Su
Tatsuya Harada
Yingying Zhu
GNN
71
10
0
19 Feb 2023
Few-shot Multimodal Multitask Multilingual Learning
Few-shot Multimodal Multitask Multilingual Learning
Aman Chadha
Vinija Jain
125
0
0
19 Feb 2023
Bridge Damage Cause Estimation Using Multiple Images Based on Visual
  Question Answering
Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering
T. Yamane
Pang-jo Chun
Jiachen Dang
Takayuki Okatani
32
0
0
18 Feb 2023
Multimodal Federated Learning via Contrastive Representation Ensemble
Multimodal Federated Learning via Contrastive Representation Ensemble
Qiying Yu
Yang Liu
Yimu Wang
Ke Xu
Jingjing Liu
86
90
0
17 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future
  Directions
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
146
47
0
14 Feb 2023
UniAdapter: Unified Parameter-Efficient Transfer Learning for
  Cross-modal Modeling
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Haoyu Lu
Yuqi Huo
Guoxing Yang
Zhiwu Lu
Wei Zhan
Masayoshi Tomizuka
Mingyu Ding
94
36
0
13 Feb 2023
Previous
123...232425...394041
Next