Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1505.00468
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
VQA: Visual Question Answering
3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VQA: Visual Question Answering"
50 / 2,957 papers shown
Title
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
161
7
0
18 Apr 2024
MEEL: Multi-Modal Event Evolution Learning
Zhengwei Tao
Zhi Jin
Junqiang Huang
Xiancai Chen
Xiaoying Bai
Haiyan Zhao
Yifan Zhang
Chongyang Tao
75
1
0
16 Apr 2024
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Songtao Jiang
Tuo Zheng
Yan Zhang
Yeying Jin
Li Yuan
Zuozhu Liu
MoE
136
23
0
16 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
172
5
0
16 Apr 2024
Epigraphics: Message-Driven Infographics Authoring
Tongyu Zhou
Jeff Huang
G. Chan
85
10
0
15 Apr 2024
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception
Yipo Huang
Xiangfei Sheng
Zhichao Yang
Quan Yuan
Zhichao Duan
Pengfei Chen
Leida Li
Weisi Lin
Guangming Shi
112
25
0
15 Apr 2024
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark
Zhaokun Zhou
Qiulin Wang
Bin Lin
Yiwei Su
Ruoxin Chen
Xin Tao
Amin Zheng
Li-xin Yuan
Pengfei Wan
Di Zhang
62
10
0
15 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
104
11
0
12 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjD
MLLM
155
51
0
11 Apr 2024
AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation
Yansheng Li
Kun Li
Yongjun Zhang
Linlin Wang
Dingwen Zhang
130
3
0
11 Apr 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Bin Wang
...
Xingcheng Zhang
Jifeng Dai
Yuxin Qiao
Dahua Lin
Jiaqi Wang
VLM
MLLM
116
127
0
09 Apr 2024
GUIDE: Graphical User Interface Data for Execution
Rajat Chawla
Adarsh Jha
Muskaan Kumar
NS Mukunda
Ishaan Bhola
LLMAG
74
3
0
09 Apr 2024
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Junpeng Liu
Yifan Song
Bill Yuchen Lin
Wai Lam
Graham Neubig
Yuanzhi Li
Xiang Yue
VLM
132
49
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoV
SSL
86
7
0
08 Apr 2024
FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback
Liqiang Jing
Xinya Du
185
18
0
07 Apr 2024
Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models
Songtao Jiang
Yan Zhang
Chenyi Zhou
Yeying Jin
Yang Feng
Jian Wu
Zuozhu Liu
LRM
VLM
100
6
0
06 Apr 2024
Which Experimental Design is Better Suited for VQA Tasks? Eye Tracking Study on Cognitive Load, Performance, and Gaze Allocations
S. Vriend
Sandeep Vidyapu
Amer Rama
Kun-Ting Chen
Daniel Weiskopf
28
3
0
05 Apr 2024
BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Ran Zmigrod
Dongsheng Wang
Mathieu Sibue
Yulong Pei
Petr Babkin
...
Antony Papadimitriou
William Watson
Zhiqiang Ma
Armineh Nourbakhsh
Sameena Shah
64
5
0
05 Apr 2024
Continual Learning of Numerous Tasks from Long-tail Distributions
Liwei Kang
Wee Sun Lee
93
0
0
03 Apr 2024
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection
Mamadou Keita
W. Hamidouche
Hessen Bougueffa Eutamene
Abdenour Hadid
Abdelmalik Taleb-Ahmed
110
9
0
02 Apr 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
93
7
0
01 Apr 2024
Dialogue with Robots: Proposals for Broadening Participation and Research in the SLIVAR Community
Casey Kennington
Malihe Alikhani
Heather Pon-Barry
Katherine Atwell
Yonatan Bisk
...
Jivko Sinapov
Angela Stewart
Matthew Stone
Stefanie Tellex
Tom Williams
104
0
0
01 Apr 2024
Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
Jialou Wang
Manli Zhu
Yulei Li
Honglei Li
Long Yang
Wai Lok Woo
37
1
0
01 Apr 2024
Continual Learning for Smart City: A Survey
Li Yang
Zhipeng Luo
Shi-sheng Zhang
Fei Teng
Tian-Jie Li
HAI
98
9
0
01 Apr 2024
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Jesse Atuhurra
Iqra Ali
Tatsuya Hiraoka
Hidetaka Kamigaito
Tomoya Iwakura
Taro Watanabe
108
1
0
29 Mar 2024
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Eri Onami
Shuhei Kurita
Taiki Miyanishi
Taro Watanabe
72
3
0
28 Mar 2024
MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models
Yanting Wang
Hongye Fu
Wei Zou
Jinyuan Jia
AAML
49
2
0
28 Mar 2024
Envisioning MedCLIP: A Deep Dive into Explainability for Medical Vision-Language Models
Anees Ur Rehman Hashmi
Dwarikanath Mahapatra
Mohammad Yaqub
VLM
MedIm
43
2
0
27 Mar 2024
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective
Meiqi Chen
Yixin Cao
Yan Zhang
Chaochao Lu
107
16
0
27 Mar 2024
EndToEndML: An Open-Source End-to-End Pipeline for Machine Learning Applications
N. Pillai
A. Das
M. Ayoola
Ganga Gireesan
B. Nanduri
Mahalingam Ramkumar
SyDa
56
2
0
27 Mar 2024
Intrinsic Subgraph Generation for Interpretable Graph based Visual Question Answering
Pascal Tilli
Ngoc Thang Vu
70
1
0
26 Mar 2024
A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions
Shun Inadumi
Seiya Kawano
Akishige Yuguchi
Yasutomo Kawanishi
Koichiro Yoshino
47
1
0
26 Mar 2024
PropTest: Automatic Property Testing for Improved Visual Programming
Jaywon Koo
Ziyan Yang
Paola Cascante-Bonilla
Baishakhi Ray
Vicente Ordonez
LRM
63
2
0
25 Mar 2024
Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
Jimyeong Kim
Jungwon Park
Wonjong Rhee
DiffM
91
5
0
22 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
66
1
0
22 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
116
6
0
21 Mar 2024
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Zeyu Han
Chao Gao
Jinyang Liu
Jeff Zhang
Sai Qian Zhang
305
403
0
21 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
171
9
0
21 Mar 2024
VL-Mamba: Exploring State Space Models for Multimodal Learning
Yanyuan Qiao
Zheng Yu
Longteng Guo
Sihan Chen
Zijia Zhao
Mingzhen Sun
Qi Wu
Jing Liu
Mamba
114
72
0
20 Mar 2024
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
Yew Ken Chia
Vernon Toh Yan Han
Deepanway Ghosal
Lidong Bing
Soujanya Poria
LRM
ReLM
86
23
0
20 Mar 2024
From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
Kung-Hsiang Huang
Hou Pong Chan
Yi R. Fung
Haoyi Qiu
Mingyang Zhou
Shafiq Joty
Shih-Fu Chang
Chenhui Xu
AI4TS
123
32
0
18 Mar 2024
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Sha Zhang
Di Huang
Jiajun Deng
Shixiang Tang
Wanli Ouyang
Tong He
Yanyong Zhang
VGen
66
18
0
18 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
128
121
0
18 Mar 2024
Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation
Ming Xu
Zilong Xie
100
2
0
18 Mar 2024
Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches
Igor Sterner
Weizhe Lin
Jinghong Chen
Bill Byrne
59
4
0
17 Mar 2024
Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts
Michael Stephen Saxon
Yiran Luo
Sharon Levy
Chitta Baral
Yezhou Yang
William Y. Wang
EGVM
92
5
0
17 Mar 2024
EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
Rocktim Jyoti Das
Simeon Emilov Hristov
Haonan Li
Dimitar Iliyanov Dimitrov
Ivan Koychev
Preslav Nakov
CoGe
ELM
116
17
0
15 Mar 2024
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models
Tian Meng
Yang Tao
Ruilin Lyu
Wuliang Yin
VLM
86
1
0
15 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
79
1
0
15 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Ming Tang
Jinqiao Wang
ObjD
98
14
0
14 Mar 2024
Previous
1
2
3
...
11
12
13
...
58
59
60
Next