Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2008.11976
Cited By
Visual Question Answering on Image Sets
27 August 2020
Ankan Bansal
Yuting Zhang
Rama Chellappa
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Visual Question Answering on Image Sets"
20 / 20 papers shown
Title
ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task
Vittorio Pippi
Matthieu Guillaumin
S. Cascianelli
Rita Cucchiara
M. Jaritz
Loris Bazzani
64
0
0
06 Mar 2025
Natural Language Generation from Visual Sequences: Challenges and Future Directions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
231
0
0
18 Feb 2025
Generative Landmarks Guided Eyeglasses Removal 3D Face Reconstruction
Dapeng Zhao
Yue Qi
3DH
CVBM
3DV
37
6
0
31 Dec 2024
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
Haojie Zheng
Tianyang Xu
Hanchi Sun
Shu Pu
Ruoxi Chen
Lichao Sun
MLLM
LRM
84
8
0
15 Nov 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
58
569
0
06 Aug 2024
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
Tsung-Han Wu
Giscard Biamby
Jerome Quenum
Ritwik Gupta
Joseph E. Gonzalez
Trevor Darrell
David M. Chan
VLM
49
7
0
18 Jul 2024
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li
Renrui Zhang
Hao Zhang
Yuanhan Zhang
Bo Li
Wei Li
Zejun Ma
Chunyuan Li
MLLM
VLM
52
198
0
10 Jul 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
50
47
0
13 Jun 2024
Step Differences in Instructional Video
Tushar Nagarajan
Lorenzo Torresani
VGen
32
5
0
24 Apr 2024
JDocQA: Japanese Document Question Answering Dataset for Generative Language Models
Eri Onami
Shuhei Kurita
Taiki Miyanishi
Taro Watanabe
27
1
0
28 Mar 2024
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Chenyu You
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
41
45
0
30 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
43
36
0
01 Nov 2023
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
Juncheng Li
Kaihang Pan
Zhiqi Ge
Minghe Gao
Wei Ji
Wenqiao Zhang
Tat-Seng Chua
Siliang Tang
Hanwang Zhang
Yueting Zhuang
MLLM
35
68
0
08 Aug 2023
Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering
A. S. Penamakuri
Manish Gupta
Mithun Das Gupta
Anand Mishra
37
7
0
29 Jun 2023
SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images
Ryota Tanaka
Kyosuke Nishida
Kosuke Nishida
Taku Hasegawa
Itsumi Saito
Kuniko Saito
25
72
0
12 Jan 2023
Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning
Hui Li
Mingjie Sun
Jimin Xiao
Eng Gee Lim
Yao-Min Zhao
29
20
0
17 Dec 2022
Image Retrieval from Contextual Descriptions
Benno Krojer
Vaibhav Adlakha
Vibhav Vineet
Yash Goyal
E. Ponti
Siva Reddy
19
29
0
29 Mar 2022
3D Question Answering
Shuquan Ye
Dongdong Chen
Songfang Han
Jing Liao
ViT
26
46
0
15 Dec 2021
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
Aviral Joshi
Chengzhi Huang
H. Singh
19
2
0
31 Mar 2021
Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning
Mingjie Sun
Jimin Xiao
Eng Gee Lim
ObjD
22
33
0
09 Mar 2021
1