Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1505.00468
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
VQA: Visual Question Answering
3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VQA: Visual Question Answering"
50 / 2,957 papers shown
Title
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers
Chongyan Chen
Samreen Anjum
Danna Gurari
96
9
0
21 Aug 2023
On the Adversarial Robustness of Multi-Modal Foundation Models
Christian Schlarmann
Matthias Hein
AAML
180
107
0
21 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
Vision Relation Transformer for Unbiased Scene Graph Generation
Gopika Sudhakaran
Devendra Singh Dhami
Kristian Kersting
Stefan Roth
ViT
117
18
0
18 Aug 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
169
310
0
17 Aug 2023
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes
Zehan Wang
Haifeng Huang
Yang Zhao
Ziang Zhang
Zhou Zhao
118
73
0
17 Aug 2023
Learning the meanings of function words from grounded language using a visual question answering model
Eva Portelance
Michael C. Frank
Dan Jurafsky
NAI
86
7
0
16 Aug 2023
Diagnosing Human-object Interaction Detectors
Fangrui Zhu
Yiming Xie
Weidi Xie
Huaizu Jiang
77
8
0
16 Aug 2023
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation
Hong Li
Xingyu Li
Pengbo Hu
Yinuo Lei
Chunxiao Li
Yi Zhou
86
27
0
15 Aug 2023
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Hongguang Zhu
Yunchao Wei
Xiaodan Liang
Chunjie Zhang
Yao-Min Zhao
VLM
72
30
0
14 Aug 2023
ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation
Chaohui Yu
Qiang-feng Zhou
Zhibin Wang
Fan Wang
VLM
30
1
0
14 Aug 2023
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
Yonatan Bitton
Hritik Bansal
Jack Hessel
Rulin Shao
Wanrong Zhu
Anas Awadalla
Josh Gardner
Rohan Taori
L. Schimdt
VLM
129
82
0
12 Aug 2023
Foundation Model is Efficient Multimodal Multitask Model Selector
Fanqing Meng
Wenqi Shao
Zhanglin Peng
Chong Jiang
Kaipeng Zhang
Yu Qiao
Ping Luo
67
17
0
11 Aug 2023
Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Guangyao Li
Wenxuan Hou
Di Hu
76
32
0
10 Aug 2023
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Yi Bin
Haoxuan Li
Yahui Xu
Xing Xu
Yang Yang
Heng Tao Shen
VOS
69
20
0
08 Aug 2023
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Juncheng Li
Kaihang Pan
Zhiqi Ge
Minghe Gao
Wei Ji
Wenqiao Zhang
Tat-Seng Chua
Siliang Tang
Hanwang Zhang
Yueting Zhuang
MLLM
121
73
0
08 Aug 2023
Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation
Yu Min
Aming Wu
Cheng Deng
91
7
0
07 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
169
720
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
133
88
0
03 Aug 2023
Making the V in Text-VQA Matter
Shamanthak Hegde
Soumya Jahagirdar
Shankar Gangisetty
CoGe
87
4
0
01 Aug 2023
FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
Zhiji Huang
Sihao Lin
Guiyu Liu
Mukun Luo
Chao Ye
Hang Xu
Xiaojun Chang
Xiaodan Liang
87
7
0
31 Jul 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
58
2
0
31 Jul 2023
Triple Correlations-Guided Label Supplementation for Unbiased Video Scene Graph Generation
Wenqing Wang
Kaifeng Gao
Yawei Luo
Tao Jiang
Fei Gao
Jian Shao
Jianwen Sun
Jun Xiao
106
3
0
30 Jul 2023
Synthesizing Event-centric Knowledge Graphs of Daily Activities Using Virtual Space
S. Egami
Takanori Ugai
Mikiko Oono
K. Kitamura
Ken Fukuda
51
11
0
30 Jul 2023
Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering
N. Naik
Christopher Potts
Elisa Kreiss
87
4
0
28 Jul 2023
Panoptic Scene Graph Generation with Semantics-Prototype Learning
Li Li
Wei Ji
Yiming Wu
Meng Li
Youxuan Qin
Lina Wei
Roger Zimmermann
98
38
0
28 Jul 2023
BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering
Khiem Vinh Tran
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
ViT
71
2
0
28 Jul 2023
PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data
Zheng Zhang
Zheng Ning
Chenliang Xu
Yapeng Tian
Toby Jia-Jun Li
98
7
0
27 Jul 2023
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Classes and Hard Negative Entities
Yongqian Li
Tingwei Lu
Hai-Tao Zheng
Tianyu Yu
Shulin Huang
Haitao Zheng
Rui Zhang
Jun Yuan
95
11
0
27 Jul 2023
LOIS: Looking Out of Instance Semantics for Visual Question Answering
Siyu Zhang
Ye Chen
Yaoru Sun
Fang Wang
Haibo Shi
Haoran Wang
59
5
0
26 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
146
128
0
25 Jul 2023
Enhancing image captioning with depth information using a Transformer-based framework
Aya Mahmoud Ahmed
Mohamed Yousef
K. Hussain
Yousef B. Mahdy
ViT
69
4
0
24 Jul 2023
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Jingxuan Wei
Cheng Tan
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Bihui Yu
R. Guo
Stan Z. Li
LRM
120
12
0
24 Jul 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
126
23
0
21 Jul 2023
Actor-agnostic Multi-label Action Recognition with Multi-modal Query
Anindya Mondal
Sauradip Nag
J. Prada
Xiatian Zhu
Anjan Dutta
67
11
0
20 Jul 2023
Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection
Guangzhi Wang
Yangyang Guo
Mohan S. Kankanhalli
71
0
0
19 Jul 2023
Explaining Autonomous Driving Actions with Visual Question Answering
Shahin Atakishiyev
Mohammad Salameh
H. Babiker
Randy Goebel
75
17
0
19 Jul 2023
A reinforcement learning approach for VQA validation: an application to diabetic macular edema grading
Tatiana Fountoukidou
Raphael Sznitman
OOD
46
5
0
19 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
146
40
0
18 Jul 2023
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization
Chaoya Jiang
Haiyang Xu
Wei Ye
Qinghao Ye
Chenliang Li
Mingshi Yan
Bin Bi
Shikun Zhang
Fei Huang
Songfang Huang
VLM
63
9
0
17 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen
Kiet Van Nguyen
47
2
0
17 Jul 2023
Sim2Plan: Robot Motion Planning via Message Passing between Simulation and Reality
Yizhou Zhao
Yuanhong Zeng
Qiang Long
Ying Nian Wu
Song-Chun Zhu
73
0
0
15 Jul 2023
DRAGON: A Dialogue-Based Robot for Assistive Navigation with Visual Language Grounding
Shuijing Liu
Aamir Hasan
Kaiwen Hong
Runxuan Wang
Peixin Chang
Z. Mizrachi
Justin Lin
D. L. McPherson
W. Rogers
Katherine Driggs-Campbell
LM&Ro
123
16
0
13 Jul 2023
Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting
Chantal Pellegrini
Matthias Keicher
Ege Özsoy
Nassir Navab
92
17
0
11 Jul 2023
SVIT: Scaling up Visual Instruction Tuning
Bo Zhao
Boya Wu
Muyang He
Tiejun Huang
MLLM
94
128
0
09 Jul 2023
Reading Between the Lanes: Text VideoQA on the Road
George Tom
Minesh Mathew
Sergi Garcia
Dimosthenis Karatzas
C. V. Jawahar
88
8
0
08 Jul 2023
MultiQG-TI: Towards Question Generation from Multi-modal Sources
Zichao Wang
Richard Baraniuk
46
5
0
07 Jul 2023
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
53
5
0
06 Jul 2023
Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Netta Madvil
Yonatan Bitton
Roy Schwartz
64
3
0
06 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
109
79
0
05 Jul 2023
Previous
1
2
3
...
18
19
20
...
58
59
60
Next