Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Wentao Yuan
Jiafei Duan
Valts Blukis
Wilbert Pumacay
Ranjay Krishna
Adithyavairavan Murali
Arsalan Mousavian
Dieter Fox
LM&Ro
116
67
0
15 Jun 2024
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Wei Chen
Lin Li
Yongqi Yang
Bin Wen
Fan Yang
Tingting Gao
Yu Wu
Long Chen
VLM
VGen
150
11
0
15 Jun 2024
ClimateIQA: A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis
Jian Chen
Peilin Zhou
Yining Hua
Dading Chong
Meng Cao
Yaowei Li
Wei Chen
Bing Zhu
Junwei Liang
Zixuan Yuan
VLM
110
3
0
14 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
92
1
0
13 Jun 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
139
59
0
13 Jun 2024
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim
Karl Pertsch
Siddharth Karamcheti
Ted Xiao
Ashwin Balakrishna
...
Russ Tedrake
Dorsa Sadigh
Sergey Levine
Percy Liang
Chelsea Finn
LM&Ro
VLM
296
535
0
13 Jun 2024
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
98
4
0
13 Jun 2024
ReMI: A Dataset for Reasoning with Multiple Images
Mehran Kazemi
Nishanth Dikkala
Ankit Anand
Petar Dević
Ishita Dasgupta
...
Bahare Fatemi
Pranjal Awasthi
Dee Guo
Sreenivas Gollapudi
Ahmed Qureshi
LRM
VLM
114
17
0
13 Jun 2024
MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning
Hanqing Wang
Zeguan Xiao
Shuo Wang
Guanhua Chen
Guanhua Chen
122
28
0
13 Jun 2024
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
Yuhang Wu
Wenmeng Yu
Yean Cheng
Yan Wang
Xiaohan Zhang
Jiazheng Xu
Ming Ding
Yuxiao Dong
108
2
0
13 Jun 2024
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
Rithesh Murthy
Liangwei Yang
Juntao Tan
Tulika Awalgaonkar
Yilun Zhou
...
Zuxin Liu
Ming Zhu
Huan Wang
Caiming Xiong
Silvio Savarese
113
6
0
12 Jun 2024
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Yi-Fan Zhang
Qingsong Wen
Chaoyou Fu
Xue Wang
Zhang Zhang
Liwen Wang
Rong Jin
135
46
0
12 Jun 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLM
OffRL
122
27
0
12 Jun 2024
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
Shimin Chen
Yitian Yuan
Shaoxiang Chen
Zequn Jie
Lin Ma
VLM
91
4
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
87
6
0
11 Jun 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Yu Qiao
Yingchun Wang
ELM
91
19
0
11 Jun 2024
Needle In A Multimodal Haystack
Weiyun Wang
Shuibo Zhang
Yiming Ren
Yuchen Duan
Tiantong Li
...
Ping Luo
Yu Qiao
Jifeng Dai
Wenqi Shao
Wenhai Wang
VLM
121
24
0
11 Jun 2024
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Wanaiu Huang
66
2
0
10 Jun 2024
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
David Romero
Chenyang Lyu
Haryo Akbarianto Wibowo
Teresa Lynn
Injy Hamed
...
Oana Ignat
Joan Nwatu
Rada Mihalcea
Thamar Solorio
Alham Fikri Aji
117
43
0
10 Jun 2024
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
Tianyu Zhang
Suyuchen Wang
Lu Li
Ge Zhang
Perouz Taslakian
Sai Rajeswar
Jie Fu
Bang Liu
Yoshua Bengio
116
5
0
10 Jun 2024
M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark
Wei Song
Yadong Li
Jianhua Xu
Guowei Wu
Lingfeng Ming
...
Weihua Luo
Houyi Li
Yi Du
Fangda Guo
Kaicheng Yu
ELM
LRM
75
8
0
08 Jun 2024
An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models
Xiongtao Zhou
Jie He
Yuhua Ke
Guangyao Zhu
Víctor Gutiérrez-Basulto
Jeff Z. Pan
98
14
0
07 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
174
37
0
07 Jun 2024
RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation
Jiaming Liu
Mengzhen Liu
Zhenyu Wang
Lily Lee
Kaichen Zhou
Pengju An
Senqiao Yang
Renrui Zhang
Yandong Guo
Shanghang Zhang
LM&Ro
LRM
Mamba
115
19
0
06 Jun 2024
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng
Jianwei Yang
Rui Tian
Xiyang Dai
Zuxuan Wu
Jianfeng Gao
Yu-Gang Jiang
VLM
95
9
0
06 Jun 2024
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Wenliang Zhong
Wenyi Wu
Qi Li
Rob Barton
Boxin Du
Shioulin Sam
Karim Bouyarmane
Ismail B. Tutar
Junzhou Huang
94
3
0
05 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
109
5
0
04 Jun 2024
Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering
Yujin Baek
Koanho Lee
Hyesu Lim
Jaeseok Kim
Junmo Park
Yu-Jung Heo
Du-Seong Chang
Jaegul Choo
45
3
0
04 Jun 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
An-Chieh Cheng
Hongxu Yin
Yang Fu
Qiushan Guo
Ruihan Yang
Jan Kautz
Xiaolong Wang
Sifei Liu
LRM
122
75
0
03 Jun 2024
Selectively Answering Visual Questions
Julian Martin Eisenschlos
Hernán Maina
Guido Ivetta
Luciana Benotti
90
0
0
03 Jun 2024
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao
Lei Li
Shuhuai Ren
Lean Wang
Yuanxin Liu
Xu Sun
Lu Hou
81
34
0
31 May 2024
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
Richard Luo
Austin Peng
Adithya Vasudev
Rishabh Jain
46
2
0
31 May 2024
Visual Perception by Large Language Model's Weights
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
74
8
0
30 May 2024
Instruction-Guided Visual Masking
Jinliang Zheng
Jianxiong Li
Si Cheng
Yinan Zheng
Jiaming Li
Jihao Liu
Yu Liu
Jingjing Liu
Xianyuan Zhan
141
7
0
30 May 2024
Evaluating Vision-Language Models on Bistable Images
Artemis Panagopoulou
Coby Melkin
Chris Callison-Burch
73
0
0
29 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
86
35
0
29 May 2024
Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu
Zi-Yi Dou
Liunian Harold Li
Amita Kamath
Nanyun Peng
Kai-Wei Chang
MLLM
119
10
0
29 May 2024
Descriptive Image Quality Assessment in the Wild
Zhiyuan You
Jinjin Gu
Zheyuan Li
Xin Cai
Kaiwen Zhu
Chao Dong
Tianfan Xue
EGVM
93
22
0
29 May 2024
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang
Alyssa Unell
Xiaohan Wang
Dhruba Ghosh
Yuchang Su
Ludwig Schmidt
Serena Yeung-Levy
VLM
101
37
0
28 May 2024
Dataset Growth
Ziheng Qin
Zhaopan Xu
Yukun Zhou
Zangwei Zheng
Zebang Cheng
...
Xiaojiang Peng
Radu Timofte
Hongxun Yao
Kai Wang
Yang You
DD
53
2
0
28 May 2024
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
114
18
0
28 May 2024
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Trishna Chakraborty
Erfan Shayegani
Zikui Cai
Nael B. Abu-Ghazaleh
M. Salman Asif
Yue Dong
Amit K. Roy-Chowdhury
Chengyu Song
88
18
0
27 May 2024
Matryoshka Multimodal Models
Mu Cai
Jianwei Yang
Jianfeng Gao
Yong Jae Lee
VLM
121
33
0
27 May 2024
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
Mustafa Shukor
Matthieu Cord
146
5
0
26 May 2024
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
152
45
0
26 May 2024
Accelerating Transformers with Spectrum-Preserving Token Merging
Hoai-Chau Tran
D. M. Nguyen
Duy M. Nguyen
Trung Thanh Nguyen
Ngan Le
Pengtao Xie
Daniel Sonntag
James Y. Zou
Binh T. Nguyen
Mathias Niepert
110
13
0
25 May 2024
Streaming Long Video Understanding with Large Language Models
Rui Qian
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Shuangrui Ding
Dahua Lin
Jiaqi Wang
VLM
142
49
0
25 May 2024
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models
Yue Zhang
Hehe Fan
Yi Yang
100
3
0
24 May 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLM
DiffM
196
22
0
24 May 2024
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
Hongyu Wang
Jiayu Xu
Senwei Xie
Ruiping Wang
Jialin Li
Zhaojie Xie
Bin Zhang
Chuyan Xiong
Xilin Chen
ELM
VLM
LRM
168
6
0
24 May 2024
Previous
1
2
3
...
11
12
13
...
39
40
41
Next