Papers
Communities
Organizations
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Tingyu Qu
Tinne Tuytelaars
Marie-Francine Moens
MoE
90
2
0
14 Mar 2024
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng
Bohan Zhou
Yicheng Feng
Ye Wang
Zongqing Lu
VLM
MLLM
86
9
0
14 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
90
11
0
13 Mar 2024
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
Yuxin Tian
Mouxing Yang
Yunfan Li
Dayiheng Liu
Xingzhang Ren
Xiaocui Peng
Jiancheng Lv
VLM
75
0
0
13 Mar 2024
CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model
Cheng Chen
Sitong Su
Xu Luo
Hengtao Shen
Lianli Gao
Jingkuan Song
CLL
71
19
0
13 Mar 2024
Fine-tuning Large Language Models with Sequential Instructions
Hanxu Hu
Simon Yu
Pinzhen Chen
Edoardo Ponti
ALM
LRM
141
15
0
12 Mar 2024
Synth
2
^2
2
: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
107
12
0
12 Mar 2024
Multi-modal Auto-regressive Modeling via Visual Words
Tianshuo Peng
Zuchao Li
Lefei Zhang
Hai Zhao
Ping Wang
Bo Du
OffRL
57
5
0
12 Mar 2024
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models
Minjie Zhu
Yichen Zhu
Xin Liu
Ning Liu
Zhiyuan Xu
Yaxin Peng
Chaomin Shen
Zhicai Ou
Feifei Feng
Jian Tang
VLM
102
22
0
10 Mar 2024
SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM
Jielin Qiu
Andrea Madotto
Zhaojiang Lin
Paul A. Crook
Yongjun Xu
Xin Luna Dong
Christos Faloutsos
Lei Li
Babak Damavandi
Seungwhan Moon
94
10
0
07 Mar 2024
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
328
577
0
07 Mar 2024
CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning
Yanqi Dai
Dong Jing
Nanyi Fei
Zhiwu Lu
Nanyi Fei
Guoxing Yang
Zhiwu Lu
112
3
0
07 Mar 2024
Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning
Deepanway Ghosal
Vernon Toh Yan Han
Chia Yew Ken
Soujanya Poria
ReLM
LRM
176
12
0
06 Mar 2024
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Gen Luo
Yiyi Zhou
Yuxin Zhang
Xiawu Zheng
Xiaoshuai Sun
Rongrong Ji
VLM
89
66
0
05 Mar 2024
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao
Peng Ye
Shengze Li
Chong Yu
Yansong Tang
Jiwen Lu
Tao Chen
88
22
0
05 Mar 2024
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters
Weizhi Wang
Khalil Mrini
Linjie Yang
Sateesh Kumar
Yu Tian
Xifeng Yan
Heng Wang
85
17
0
05 Mar 2024
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal
Aditya Avinash
N. Alldrin
Jan Dlabal
Wenlei Zhou
...
Chun-Ta Lu
Howard Zhou
Ranjay Krishna
Ariel Fuxman
Tom Duerig
VLM
147
7
0
05 Mar 2024
VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing
Zhiyuan Chang
Mingyang Li
Junjie Wang
Cheng Li
Qing Wang
60
0
0
05 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
168
12
0
05 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
118
36
0
04 Mar 2024
An Improved Traditional Chinese Evaluation Suite for Foundation Model
Zhi Rui Tam
Ya-Ting Pai
Yen-Wei Lee
Jun-Da Chen
Wei-Min Chu
Sega Cheng
Hong-Han Shuai
ELM
128
12
0
04 Mar 2024
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models
Lizhou Fan
Wenyue Hua
Xiang Li
Kaijie Zhu
Mingyu Jin
...
Haoyang Ling
Jinkui Chi
Jindong Wang
Xin Ma
Yongfeng Zhang
LRM
88
14
0
04 Mar 2024
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Gagan Bhatia
Abdelrahman Mohamed
Muhammad Abdul-Mageed
VLM
LRM
92
16
0
01 Mar 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
145
53
0
29 Feb 2024
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
97
10
0
28 Feb 2024
Generative AI for Unmanned Vehicle Swarms: Challenges, Applications and Opportunities
Guangyuan Liu
Nguyen Van Huynh
Hongyang Du
D. Hoang
Dusit Niyato
Kun Zhu
Jiawen Kang
Zehui Xiong
Abbas Jamalipour
Dong In Kim
103
14
0
28 Feb 2024
All in an Aggregated Image for In-Image Learning
Lei Wang
Wanyu Xu
Zhiqiang Hu
Yihuai Lan
Shan Dong
Hao Wang
Roy Ka-wei Lee
Ee-Peng Lim
VLM
101
1
0
28 Feb 2024
Probing Multimodal Large Language Models for Global and Local Semantic Representations
Mingxu Tao
Quzhe Huang
Kun Xu
Liwei Chen
Yansong Feng
Dongyan Zhao
84
5
0
27 Feb 2024
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
82
2
0
27 Feb 2024
Measuring Vision-Language STEM Skills of Neural Models
Jianhao Shen
Ye Yuan
Srbuhi Mirzoyan
Ming Zhang
Chenguang Wang
VLM
143
12
0
27 Feb 2024
VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen
Yurun Song
Siwei Wu
Rui Xia
118
6
0
27 Feb 2024
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation
Yi Zong
Xipeng Qiu
ELM
VLM
64
8
0
24 Feb 2024
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
Chaoya Jiang
Wei Ye
Mengfan Dong
Hongrui Jia
Haiyang Xu
Mingshi Yan
Ji Zhang
Shikun Zhang
VLM
MLLM
120
16
0
24 Feb 2024
Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning
Tejas Srinivasan
Jack Hessel
Tanmay Gupta
Bill Yuchen Lin
Yejin Choi
Jesse Thomason
Khyathi Chandu
73
9
0
23 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
68
2
0
23 Feb 2024
CommVQA: Situating Visual Question Answering in Communicative Contexts
N. Naik
Christopher Potts
Elisa Kreiss
CoGe
44
0
0
22 Feb 2024
Uncertainty-Aware Evaluation for Vision-Language Models
Vasily Kostumov
Bulat Nutfullin
Oleg Pilipenko
Eugene Ilyushin
ELM
219
9
0
22 Feb 2024
Vision-Language Navigation with Embodied Intelligence: A Survey
Peng Gao
Peng Wang
Feng Gao
Fei Wang
Ruyue Yuan
LM&Ro
108
3
0
22 Feb 2024
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
Yunxin Li
Xinyu Chen
Baotian Hu
Haoyuan Shi
Min Zhang
85
5
0
21 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Aman Chadha
VLM
127
33
0
20 Feb 2024
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
Li Mi
Syrielle Montariol
J. Castillo-Navarro
Xianjie Dai
Antoine Bosselut
D. Tuia
59
4
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
136
36
0
20 Feb 2024
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?
Nishant Balepur
Abhilasha Ravichander
Rachel Rudinger
ELM
134
28
0
19 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRM
VLM
139
64
0
19 Feb 2024
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann
Naman D. Singh
Francesco Croce
Matthias Hein
VLM
AAML
106
50
0
19 Feb 2024
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion
Ziyue Wang
Chi Chen
Yiqi Zhu
Ziyue Wang
Peng Li
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Liu
104
5
0
19 Feb 2024
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models
Didi Zhu
Zhongyi Sun
Zexi Li
Tao Shen
Ke Yan
Shouhong Ding
Kun Kuang
Chao Wu
CLL
KELM
MoMe
131
31
0
19 Feb 2024
Efficient Multimodal Learning from Data-centric Perspective
Muyang He
Yexin Liu
Boya Wu
Jianhao Yuan
Yueze Wang
Tiejun Huang
Bo Zhao
MLLM
98
88
0
18 Feb 2024
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Yixin Yang
Zheng Li
Qingxiu Dong
Heming Xia
Zhifang Sui
VLM
94
11
0
17 Feb 2024
II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering
Jihyung Kil
Farideh Tavazoee
Dongyeop Kang
Joo-Kyung Kim
LRM
71
3
0
16 Feb 2024
Previous
1
2
3
...
14
15
16
...
39
40
41
Next