ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.00468
  4. Cited By
VQA: Visual Question Answering
v1v2v3v4v5v6v7 (latest)

VQA: Visual Question Answering

3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
    CoGe
ArXiv (abs)PDFHTML

Papers citing "VQA: Visual Question Answering"

50 / 2,957 papers shown
Title
Towers of Babel: Combining Images, Language, and 3D Geometry for
  Learning Multimodal Vision
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision
Xiaoshi Wu
Hadar Averbuch-Elor
J. Sun
Noah Snavely
84
20
0
12 Aug 2021
A Better Loss for Visual-Textual Grounding
A Better Loss for Visual-Textual Grounding
Davide Rigoni
Luciano Serafini
A. Sperduti
ObjD
62
3
0
11 Aug 2021
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual
  Task Completion
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Alessandro Suglia
Qiaozi Gao
Jesse Thomason
Govind Thattai
Gaurav Sukhatme
LM&Ro
133
78
0
10 Aug 2021
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language
  Models
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Zheyuan Liu
Cristian Rodriguez-Opazo
Damien Teney
Stephen Gould
VLM
84
207
0
09 Aug 2021
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned
  Representation Learning
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
Sheng Liu
Kevin Qinghong Lin
Lijuan Wang
Junsong Yuan
Zicheng Liu
VLM
37
3
0
08 Aug 2021
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning
Bryan Wang
Gang Li
Xin Zhou
Zhourong Chen
Tovi Grossman
Yang Li
212
160
0
07 Aug 2021
Interpretable Visual Understanding with Cognitive Attention Network
Interpretable Visual Understanding with Cognitive Attention Network
Xuejiao Tang
Wenbin Zhang
Yi Yu
Kea Turner
Hanyu Wang
Mengyu Wang
Eirini Ntoutsi
136
12
0
06 Aug 2021
Communicative Learning with Natural Gestures for Embodied Navigation
  Agents with Human-in-the-Scene
Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene
Qi Wu
Cheng-Ju Wu
Yixin Zhu
Jungseock Joo
99
14
0
05 Aug 2021
Hybrid Reasoning Network for Video-based Commonsense Captioning
Hybrid Reasoning Network for Video-based Commonsense Captioning
Weijiang Yu
Jian Liang
Lei Ji
Lu Li
Yuejian Fang
Nong Xiao
Nan Duan
67
10
0
05 Aug 2021
Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal
  Retrieval Service
Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service
Zhongwei Xie
Ling Liu
Yanzhao Wu
Lin Li
Luo Zhong
52
26
0
02 Aug 2021
Chest ImaGenome Dataset for Clinical Reasoning
Chest ImaGenome Dataset for Clinical Reasoning
Joy T. Wu
Nkechinyere N. Agu
Ismini Lourentzou
Arjun Sharma
J. Paguio
...
William Mitchell
Satyananda Kashyap
Andrea Giovannini
Leo Anthony Celi
Mehdi Moradi
58
67
0
31 Jul 2021
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering
  and Reading Comprehension
QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
Anna Rogers
Matt Gardner
Isabelle Augenstein
137
170
0
27 Jul 2021
Greedy Gradient Ensemble for Robust Visual Question Answering
Greedy Gradient Ensemble for Robust Visual Question Answering
Xinzhe Han
Shuhui Wang
Chi Su
Qingming Huang
Q. Tian
65
78
0
27 Jul 2021
Adaptive Hierarchical Graph Reasoning with Semantic Coherence for
  Video-and-Language Inference
Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference
Juncheng Li
Siliang Tang
Linchao Zhu
Haochen Shi
Xuanwen Huang
Leilei Gan
Yi Yang
Yueting Zhuang
112
28
0
26 Jul 2021
X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization
  in Visual Question Answering
X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Yifan Liu
Zhixiong Nan
N. Zheng
OOD
89
19
0
24 Jul 2021
Adversarial Reinforced Instruction Attacker for Robust Vision-Language
  Navigation
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation
Bingqian Lin
Yi Zhu
Yanxin Long
Xiaodan Liang
QiXiang Ye
Liang Lin
AAML
89
16
0
23 Jul 2021
Neural Variational Learning for Grounded Language Acquisition
Neural Variational Learning for Grounded Language Acquisition
Nisha Pillai
Cynthia Matuszek
Francis Ferraro
VLMSSLGANDRL
113
2
0
20 Jul 2021
Neural Abstructions: Abstractions that Support Construction for Grounded
  Language Learning
Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning
Kaylee Burns
Christopher D. Manning
Li Fei-Fei
53
0
0
20 Jul 2021
Separating Skills and Concepts for Novel Visual Question Answering
Separating Skills and Concepts for Novel Visual Question Answering
Spencer Whitehead
Hui Wu
Heng Ji
Rogerio Feris
Kate Saenko
CoGe
95
34
0
19 Jul 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
341
1,991
0
16 Jul 2021
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Zetian Wu
Yun Cheng
...
Peter Wu
Michelle A. Lee
Yuke Zhu
Ruslan Salakhutdinov
Louis-Philippe Morency
VLM
111
172
0
15 Jul 2021
How Much Can CLIP Benefit Vision-and-Language Tasks?
How Much Can CLIP Benefit Vision-and-Language Tasks?
Sheng Shen
Liunian Harold Li
Hao Tan
Joey Tianyi Zhou
Anna Rohrbach
Kai-Wei Chang
Z. Yao
Kurt Keutzer
CLIPVLMMLLM
276
412
0
13 Jul 2021
Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
  Answering
Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering
Rajat Koner
Hang Li
Marcel Hildebrandt
Deepan Das
Volker Tresp
Stephan Günnemann
63
34
0
13 Jul 2021
Zero-shot Visual Question Answering using Knowledge Graph
Zero-shot Visual Question Answering using Knowledge Graph
Zhuo Chen
Jiaoyan Chen
Yuxia Geng
Jeff Z. Pan
Zonggang Yuan
Huajun Chen
87
70
0
12 Jul 2021
Split, embed and merge: An accurate table structure recognizer
Split, embed and merge: An accurate table structure recognizer
Zhenrong Zhang
Jianshu Zhang
Jun Du
LMTD
184
62
0
12 Jul 2021
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
Jianyu Wang
Bingkun Bao
Changsheng Xu
66
75
0
10 Jul 2021
Using Depth for Improving Referring Expression Comprehension in
  Real-World Environments
Using Depth for Improving Referring Expression Comprehension in Real-World Environments
Fethiye Irmak Dogan
Iolanda Leite
52
5
0
09 Jul 2021
Multi-Modality Task Cascade for 3D Object Detection
Multi-Modality Task Cascade for 3D Object Detection
Jinhyung D. Park
Xinshuo Weng
Yunze Man
Kris Kitani
3DPC
48
7
0
08 Jul 2021
MuVAM: A Multi-View Attention-based Model for Medical Visual Question
  Answering
MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering
Haiwei Pan
Shuning He
Kejia Zhang
Bo Qu
Chunling Chen
Kun Shi
56
11
0
07 Jul 2021
Deep Learning for Embodied Vision Navigation: A Survey
Deep Learning for Embodied Vision Navigation: A Survey
Fengda Zhu
Yi Zhu
Vincent CS Lee
Xiaodan Liang
Xiaojun Chang
EgoVLM&Ro
101
0
0
07 Jul 2021
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
  Transfer
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
Zineng Tang
Jaemin Cho
Hao Tan
Joey Tianyi Zhou
VLM
59
29
0
06 Jul 2021
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior
  for Joint Image-Text Modeling
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling
Xiaoxue Zang
Lijuan Liu
Maria Wang
Yang Song
Hao Zhang
Jindong Chen
VLM
103
60
0
06 Jul 2021
Mind Your Outliers! Investigating the Negative Impact of Outliers on
  Active Learning for Visual Question Answering
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
Siddharth Karamcheti
Ranjay Krishna
Li Fei-Fei
Christopher D. Manning
96
92
0
06 Jul 2021
Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory
Xuejiao Tang
Xin Huang
Wenbin Zhang
T. Child
Qiong Hu
Zhen Liu
Ji Zhang
LRM
81
19
0
04 Jul 2021
Target-dependent UNITER: A Transformer-Based Multimodal Language
  Comprehension Model for Domestic Service Robots
Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots
Shintaro Ishikawa
K. Sugiura
67
11
0
02 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
  Generation
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
87
38
0
01 Jul 2021
Weakly Supervised Temporal Adjacent Network for Language Grounding
Weakly Supervised Temporal Adjacent Network for Language Grounding
Yuechen Wang
Jiajun Deng
Wen-gang Zhou
Houqiang Li
105
67
0
30 Jun 2021
Unified Questioner Transformer for Descriptive Question Generation in
  Goal-Oriented Visual Dialogue
Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue
Shoya Matsumori
Kosuke Shingyouchi
Yukikoko Abe
Yosuke Fukuchi
K. Sugiura
M. Imai
99
16
0
29 Jun 2021
Adventurer's Treasure Hunt: A Transparent System for Visually Grounded
  Compositional Visual Question Answering based on Scene Graphs
Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs
Daniel Reich
F. Putze
Tanja Schultz
61
2
0
28 Jun 2021
Building a Video-and-Language Dataset with Human Actions for Multimodal
  Logical Inference
Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference
Riko Suzuki
Hitomi Yanaka
K. Mineshima
D. Bekki
VGenMLLM
53
1
0
27 Jun 2021
Saying the Unseen: Video Descriptions via Dialog Agents
Saying the Unseen: Video Descriptions via Dialog Agents
Ye Zhu
Yu Wu
Yi Yang
Yan Yan
71
6
0
26 Jun 2021
Core Challenges in Embodied Vision-Language Planning
Core Challenges in Embodied Vision-Language Planning
Jonathan M Francis
Nariaki Kitamura
Felix Labelle
Xiaopeng Lu
Ingrid Navarro
Jean Oh
LM&Ro
144
48
0
26 Jun 2021
A Picture May Be Worth a Hundred Words for Visual Question Answering
A Picture May Be Worth a Hundred Words for Visual Question Answering
Yusuke Hirota
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Ittetsu Taniguchi
Takao Onoye
ViT
35
4
0
25 Jun 2021
iReason: Multimodal Commonsense Reasoning using Videos and Natural
  Language with Interpretability
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
Andrew Wang
Aman Chadha
CML
34
5
0
25 Jun 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training
  for VQA Challenge 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021
Keda Lu
Bo Fang
Kuan-Yu Chen
ViT
45
2
0
24 Jun 2021
Multimodal Emergent Fake News Detection via Meta Neural Process Networks
Multimodal Emergent Fake News Detection via Meta Neural Process Networks
Yaqing Wang
Fenglong Ma
Haoyu Wang
Kishlay Jha
Jing Gao
131
61
0
22 Jun 2021
Visual Probing: Cognitive Framework for Explaining Self-Supervised Image
  Representations
Visual Probing: Cognitive Framework for Explaining Self-Supervised Image Representations
Witold Oleszkiewicz
Dominika Basaj
Igor Sieradzki
Michal Górszczak
Barbara Rychalska
K. Lewandowska
Tomasz Trzciñski
Bartosz Zieliñski
SSL
75
3
0
21 Jun 2021
Interventional Video Grounding with Dual Contrastive Learning
Interventional Video Grounding with Dual Contrastive Learning
Guoshun Nan
Rui Qiao
Yao Xiao
Jun Liu
Sicong Leng
H. Zhang
Wei Lu
98
145
0
21 Jun 2021
Attend What You Need: Motion-Appearance Synergistic Networks for Video
  Question Answering
Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
Ahjeong Seo
Gi-Cheon Kang
J. Park
Byoung-Tak Zhang
82
54
0
19 Jun 2021
GEM: A General Evaluation Benchmark for Multimodal Tasks
GEM: A General Evaluation Benchmark for Multimodal Tasks
Lin Su
Nan Duan
Edward Cui
Lei Ji
Chenfei Wu
Huaishao Luo
Yongfei Liu
Ming Zhong
Taroon Bharti
Arun Sacheti
VLM
112
19
0
18 Jun 2021
Previous
123...343536...585960
Next