Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1505.00468
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
VQA: Visual Question Answering
3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VQA: Visual Question Answering"
50 / 2,957 papers shown
Title
GraphicBench: A Planning Benchmark for Graphic Design with Language Agents
Dayeon Ki
Dinesh Manocha
Marine Carpuat
Gang Wu
Puneet Mathur
Viswanathan Swaminathan
LLMAG
LM&Ro
85
0
0
15 Apr 2025
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Zeyi Huang
Haohan Wang
LRM
58
0
0
14 Apr 2025
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Zihan Ling
Zhiyao Guo
Yixuan Huang
Yi An
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Bo Zheng
RALM
VLM
179
0
0
14 Apr 2025
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks
Mohammad Saleha
Azadeh Tabatabaeib
148
0
0
14 Apr 2025
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Shengao Wang
Arjun Chandra
Aoming Liu
Venkatesh Saligrama
Boqing Gong
MLLM
VLM
101
0
0
13 Apr 2025
Evolved Hierarchical Masking for Self-Supervised Learning
Zhanzhou Feng
Shiliang Zhang
141
0
0
12 Apr 2025
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
Qi Zhi Lim
C. Lee
K. Lim
Kalaiarasi Sonai Muthu Anbananthen
75
0
0
11 Apr 2025
Impact of Language Guidance: A Reproducibility Study
Cherish Puniani
Advika Sinha
Shree Singhi
Aayan Yadav
VLM
206
0
0
10 Apr 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Yang Liu
Qi Wang
Fuzheng Zhang
VLM
145
2
0
10 Apr 2025
Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models
Xingguang Ji
Jiakang Wang
Hongzhi Zhang
Jingyuan Zhang
Haonan Zhou
Chenxi Sun
Yang Liu
Qi Wang
Fuzheng Zhang
MLLM
VLM
134
0
0
10 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
73
0
0
09 Apr 2025
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng
Linjie Li
Zhiyong Yang
Ping Yu
Alex Jinpeng Wang
Rui Yan
Yuan Yao
Lijuan Wang
LRM
67
1
0
08 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
99
16
0
07 Apr 2025
Locations of Characters in Narratives: Andersen and Persuasion Datasets
Batuhan Ozyurt
Roya Arkhmammadova
Deniz Yuret
75
2
0
04 Apr 2025
RBT4DNN: Requirements-based Testing of Neural Networks
Nusrat Jahan Mozumder
Felipe Toledo
Swaroopa Dola
Matthew B. Dwyer
AAML
142
1
0
03 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLM
AAML
126
1
0
02 Apr 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
Jiawei Wang
Yushen Zuo
Yuanjun Chai
Ziqiang Liu
Yichen Fu
Yichun Feng
Kin-Man Lam
AAML
VLM
146
0
0
02 Apr 2025
PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks
Abdelrahman Elskhawy
Mengze Li
Nassir Navab
Benjamin Busam
VLM
95
1
0
01 Apr 2025
SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering
Bingxin Li
95
0
0
01 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
Jing Liu
Peijie Wang
Jing Tao
Zhou Su
124
2
0
01 Apr 2025
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Yongbin Li
Yize Zhang
Tao Lin
Xiangrui Liu
Wenxiao Cai
Zhengyang Liang
Bo Zhao
LRM
119
9
0
31 Mar 2025
CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Zhiqiang Wang
Pengbin Feng
Yanbin Lin
Shuzhang Cai
Zongao Bian
Jinghua Yan
Xingquan Zhu
82
4
0
31 Mar 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
Slava Voloshynovskiy
DiffM
107
0
0
31 Mar 2025
Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Zhecheng Li
Guoxian Song
Yujun Cai
Zhen Xiong
Junsong Yuan
Yiwei Wang
3DV
VLM
165
0
0
31 Mar 2025
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization
Iñigo Pikabea
Iñaki Lacunza
Oriol Pareras
Carlos Escolano
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
VLM
203
1
0
28 Mar 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou
Malvina Nikandrou
Georgios Pantazopoulos
Danae Sanchez Villegas
Phillip Rust
Ruchira Dhar
Daniel Hershcovich
Anders Søgaard
73
0
0
28 Mar 2025
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Patrice Bechard
Chao Wang
Amirhossein Abaskohi
Juan A. Rodriguez
Christopher Pal
David Vazquez
Spandana Gella
Sai Rajeswar
Perouz Taslakian
107
0
0
27 Mar 2025
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Yichao Yan
VLM
240
1
0
26 Mar 2025
ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
Chenxi Wang
Jizhan Fang
Xiang Chen
Bozhong Tian
Ziwen Xu
Hong Chen
N. Zhang
KELM
143
0
0
26 Mar 2025
BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
Jan Kohút
Martin Dočekal
Michal Hradiš
Marek Vaško
85
0
0
25 Mar 2025
OCRT: Boosting Foundation Models in the Open World with Object-Concept-Relation Triad
Luyao Tang
Yuxuan Yuan
Chen Chen
Zeyu Zhang
Yue Huang
Kun Zhang
98
1
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Helen Zhou
Bo Yuan
VLM
155
2
0
24 Mar 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Shuo Yang
Siwen Luo
S. Han
Eduard Hovy
LRM
64
6
0
24 Mar 2025
A Study on Neuro-Symbolic Artificial Intelligence: Healthcare Perspectives
Delower Hossain
Jake Y Chen
NAI
85
1
0
23 Mar 2025
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang
Yanjiang Liu
Xianpei Han
Yaojie Lu
Hongyu Lin
Jia Zheng
Jia Zheng
Le Sun
Le Sun
Yingfei Sun
97
0
0
23 Mar 2025
A Survey on Mathematical Reasoning and Optimization with Large Language Models
Ali Forootani
OffRL
LRM
AI4CE
118
1
0
22 Mar 2025
Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models
Ketan Suhaas Saichandran
Xavier Thomas
Prakhar Kaushik
Deepti Ghadiyaram
DiffM
153
1
0
22 Mar 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi
Jiawei Liu
Hao Tang
Zhigang Zhu
165
4
0
21 Mar 2025
Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction
Mengsay Loem
Taiju Hosaka
85
0
0
21 Mar 2025
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Jie M. Zhang
Zheng Yuan
Ziyi Wang
Bei Yan
Sibo Wang
Xiangkui Cao
Zonghui Guo
Shiguang Shan
Xilin Chen
ELM
137
0
0
20 Mar 2025
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
Haochen Zhang
Nader Zantout
Pujith Kachana
Ji Zhang
Wenshan Wang
VGen
83
0
0
20 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
Han Wang
Kai Hu
Liangcai Gao
341
0
0
20 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
VLM
192
1
0
19 Mar 2025
Where do Large Vision-Language Models Look at when Answering Questions?
X. Xing
Chia-Wen Kuo
Li Fuxin
Yulei Niu
Fan Chen
Ming Li
Ying Wu
Longyin Wen
Sijie Zhu
LRM
119
1
0
18 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
142
8
0
18 Mar 2025
ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Vlad Hondru
Eduard Hogea
Darian M. Onchis
Radu Tudor Ionescu
147
2
0
18 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
108
3
0
18 Mar 2025
A Survey on the Optimization of Large Language Model-based Agents
Shangheng Du
Jiabao Zhao
Jinxin Shi
Zhentao Xie
Xin Jiang
Yanhong Bai
Liang He
LLMAG
LM&Ro
LM&MA
539
5
0
16 Mar 2025
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
Xirui Zhou
Lianlei Shan
Xiaolin Gui
91
0
0
14 Mar 2025
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
Kaixuan Jiang
Yang Liu
Weixing Chen
Jingzhou Luo
Ziliang Chen
Ling Pan
G. Li
Liang Lin
108
4
0
14 Mar 2025
Previous
1
2
3
4
5
6
...
58
59
60
Next