Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1505.00468
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
VQA: Visual Question Answering
3 May 2015
Aishwarya Agrawal
Jiasen Lu
Stanislaw Antol
Margaret Mitchell
C. L. Zitnick
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VQA: Visual Question Answering"
50 / 2,957 papers shown
Title
Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?
Antonia Wüst
Tim Nelson Tobiasch
Lukas Helff
Inga Ibs
Wolfgang Stammer
Devendra Singh Dhami
Constantin Rothkopf
Kristian Kersting
CoGe
ReLM
VLM
LRM
172
3
0
25 Oct 2024
Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant
A. S. Penamakuri
Anand Mishra
108
1
0
24 Oct 2024
Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
Xinyi Ling
Bo Peng
Hanwen Du
Zhihui Zhu
Xia Ning
107
0
0
22 Oct 2024
AttriPrompter: Auto-Prompting with Attribute Semantics for Zero-shot Nuclei Detection via Visual-Language Pre-trained Models
Yongjian Wu
Yang Zhou
Jiya Saiyin
Bingzheng Wei
M. Lai
Jianzhong Shou
Yan Xu
VLM
MedIm
109
1
0
22 Oct 2024
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation
Shota Onohara
Atsuyuki Miyai
Yuki Imajuku
Kazuki Egashira
Jeonghun Baek
Xiang Yue
Graham Neubig
Kiyoharu Aizawa
OSLM
258
6
0
22 Oct 2024
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Deeparghya Dutta Barua
Md Sakib Ul Rahman Sourove
Md Fahim
Fabiha Haider
Fariha Tanjim Shifat
Md Tasmim Rahman Adib
Anam Borhan Uddin
Md Farhan Ishmam
Md Farhad Alam
79
0
0
19 Oct 2024
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
Muhe Ding
Yang Ma
Pengda Qin
Jianlong Wu
Yuhong Li
Liqiang Nie
78
1
0
18 Oct 2024
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Nghia Hieu Nguyen
Tho Thanh Quan
Ngan Luu-Thuy Nguyen
75
0
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
219
31
0
18 Oct 2024
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang
Chengqi Duan
Kun Wang
Hao Li
H. Tian
Xingyu Zeng
Rui Zhao
Jifeng Dai
Hongsheng Li
Xihui Liu
MLLM
124
15
0
17 Oct 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang
Xi Feng
Yuelin Bai
Xinrun Du
Jinchang Hou
...
Min Yang
Wenhao Huang
Chenghua Lin
Ge Zhang
Shiwen Ni
ELM
VLM
69
6
0
17 Oct 2024
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks
Shailaja Keyur Sampat
Mutsumi Nakamura
Shankar Kailas
Kartik Aggarwal
Mandy Zhou
Yezhou Yang
Chitta Baral
MLLM
CoGe
ReLM
VLM
LRM
78
0
0
17 Oct 2024
Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?
Shailaja Keyur Sampat
Maitreya Patel
Yezhou Yang
Chitta Baral
35
0
0
17 Oct 2024
RescueADI: Adaptive Disaster Interpretation in Remote Sensing Images with Autonomous Agents
Zhuoran Liu
Danpei Zhao
Bo Yuan
84
1
0
17 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
139
5
0
16 Oct 2024
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs
Sihang Zhao
Youliang Yuan
Xiaoying Tang
Pinjia He
83
3
0
15 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
113
22
1
14 Oct 2024
Eliminating the Language Bias for Visual Question Answering with fine-grained Causal Intervention
Ying Liu
Ge Bai
Chenji Lu
Shilong Li
Zhang Zhang
Ruifang Liu
Wenbin Guo
49
0
0
14 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
72
6
0
14 Oct 2024
Can We Predict Performance of Large Models across Vision-Language Tasks?
Qinyu Zhao
Ming Xu
Kartik Gupta
Akshay Asthana
Liang Zheng
Stephen Gould
128
0
0
14 Oct 2024
Locality Alignment Improves Vision-Language Models
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
267
7
0
14 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
29
0
0
13 Oct 2024
ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos
Arpan Phukan
Manish Gupta
Asif Ekbal
VGen
82
0
0
13 Oct 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLM
CoGe
70
13
0
13 Oct 2024
Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets
Thomas Eiter
Jan Hadl
N. Higuera
J. Oetsch
51
0
0
12 Oct 2024
Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
Xiaoyuan Liu
Wenxuan Wang
Youliang Yuan
Jen-tse Huang
Qiuzhi Liu
Pinjia He
Zhaopeng Tu
434
4
0
10 Oct 2024
COMMA: A Communicative Multimodal Multi-Agent Benchmark
Timothy Ossowski
Jixuan Chen
Danyal Maqbool
Zefan Cai
Tyler Bradshaw
Junjie Hu
VLM
98
3
0
10 Oct 2024
Chain-of-Sketch: Enabling Global Visual Reasoning
Aryo Lotfi
Enrico Fini
Samy Bengio
Moin Nabi
Emmanuel Abbe
LRM
92
0
0
10 Oct 2024
ING-VP: MLLMs cannot Play Easy Vision-based Games Yet
Haoran Zhang
Hangyu Guo
Shuyue Guo
Meng Cao
Wenhao Huang
Jiaheng Liu
Ge Zhang
VLM
MLLM
LRM
85
3
0
09 Oct 2024
ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments
Sourjyadip Ray
Kushal Gupta
Soumi Kundu
Payal Arvind Kasat
Somak Aditya
Pawan Goyal
35
2
0
08 Oct 2024
Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects
Wenhao Li
Yudong Xu
Scott Sanner
Elias Boutros Khalil
ViT
100
5
0
08 Oct 2024
EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment
Yifei Xing
Xiangyuan Lan
Ruiping Wang
D. Jiang
Wenjun Huang
Qingfang Zheng
Yaowei Wang
Mamba
121
0
0
08 Oct 2024
Core Tokensets for Data-efficient Sequential Training of Transformers
Subarnaduti Paul
Manuel Brack
P. Schramowski
Kristian Kersting
Martin Mundt
72
0
0
08 Oct 2024
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang
Chi Chen
Ziyue Wang
Yurui Dong
Yuanchi Zhang
Yuzhuang Xu
Xiaolong Wang
Ziwei Sun
Yang Liu
LRM
113
3
0
07 Oct 2024
MM-R
3
^3
3
: On (In-)Consistency of Vision-Language Models (VLMs)
Shih-Han Chou
Shivam Chandhok
James J. Little
Leonid Sigal
82
0
0
07 Oct 2024
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress
Christopher Agia
Rohan Sinha
Jingyun Yang
Zi-ang Cao
Rika Antonova
Marco Pavone
Jeannette Bohg
94
9
0
06 Oct 2024
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta
Shreyas Verma
Ujjwala Anantheswaran
Kevin Scaria
Mihir Parmar
Swaroop Mishra
Chitta Baral
ReLM
LRM
74
8
0
06 Oct 2024
MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration
Lai Wei
Wenkai Wang
Xiaoyu Shen
Yu Xie
Zhihao Fan
Xiaojin Zhang
Zhongyu Wei
Wei Chen
70
7
0
06 Oct 2024
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning
Minheng Ni
Yutao Fan
Lei Zhang
Wangmeng Zuo
LRM
AI4CE
68
12
0
04 Oct 2024
Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models
Yufang Liu
Tao Ji
Changzhi Sun
Yuanbin Wu
Aimin Zhou
VLM
MLLM
90
3
0
04 Oct 2024
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
Yue Zhang
Zhiyang Xu
Ying Shen
Parisa Kordjamshidi
Lifu Huang
129
8
0
04 Oct 2024
NL-Eye: Abductive NLI for Images
Mor Ventura
Michael Toker
Nitay Calderon
Zorik Gekhman
Yonatan Bitton
Roi Reichart
79
1
0
03 Oct 2024
BadCM: Invisible Backdoor Attack Against Cross-Modal Learning
Zheng Zhang
Xu Yuan
Lei Zhu
Jingkuan Song
Liqiang Nie
AAML
85
12
0
03 Oct 2024
Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities
Kenza Amara
Lukas Klein
Carsten T. Lüth
Paul Jäger
Hendrik Strobelt
Mennatallah El-Assady
80
2
0
02 Oct 2024
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong
Min Namgung
Zae Myung Kim
Dongyeop Kang
Yao-Yi Chiang
Alfred Hero
120
0
0
02 Oct 2024
Backdooring Vision-Language Models with Out-Of-Distribution Data
Weimin Lyu
Jiachen Yao
Saumya Gupta
Lu Pang
Tao Sun
Lingjie Yi
Lijie Hu
Haibin Ling
Chao Chen
VLM
AAML
138
8
0
02 Oct 2024
LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models
Zhenyue Qin
Yu Yin
Dylan Campbell
Xuansheng Wu
Ke Zou
Yih-Chung Tham
Ninghao Liu
Xiuzhen Zhang
Qingyu Chen
123
1
0
02 Oct 2024
Visual Question Decomposition on Multimodal Large Language Models
Haowei Zhang
Jianzhe Liu
Zhen Han
Shuo Chen
Bailan He
Volker Tresp
Zhiqiang Xu
Jindong Gu
157
2
0
28 Sep 2024
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
Ishani Mondal
Zongxia Li
Yufang Hou
Anandhavelu Natarajan
Aparna Garimella
Jordan Boyd-Graber
65
4
0
28 Sep 2024
TrojVLM: Backdoor Attack Against Vision Language Models
Weimin Lyu
Lu Pang
Tengfei Ma
Haibin Ling
Chao Chen
MLLM
97
11
0
28 Sep 2024
Previous
1
2
3
...
6
7
8
...
58
59
60
Next