Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1612.00837
Cited By
v1
v2
v3 (latest)
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
2 December 2016
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering"
50 / 2,037 papers shown
Title
Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy
Te Yang
Jian Jia
Xiangyu Zhu
Weisong Zhao
Bo Wang
...
Shengyuan Liu
Quan Chen
Peng Jiang
Kun Gai
Zhen Lei
86
1
0
23 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
146
3
0
23 Nov 2024
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Qizhou Chen
Chengyu Wang
Dakan Wang
Taolin Zhang
Wangyue Li
Xiaofeng He
KELM
161
1
0
23 Nov 2024
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
Honglin Li
Yuting Gao
Chenglu Zhu
Jingdong Chen
M. Yang
Lin Yang
MLLM
205
0
0
21 Nov 2024
MMGenBench: Fully Automatically Evaluating LMMs from the Text-to-Image Generation Perspective
Hailang Huang
Yong Wang
Zixuan Huang
Huaqiu Li
Tongwen Huang
Xiangxiang Chu
Richong Zhang
MLLM
LM&MA
EGVM
138
1
0
21 Nov 2024
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
Zhen Zeng
Leijiang Gu
Xun Yang
Zhangling Duan
Zenglin Shi
Meng Wang
KELM
134
2
0
19 Nov 2024
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
Raihan Kabir
Naznin Haque
Md. Saiful Islam
Marium-E. Jannat
CoGe
91
1
0
17 Nov 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang
Zhe Chen
Wenhai Wang
Yue Cao
Yangzhou Liu
...
Jinguo Zhu
X. Zhu
Lewei Lu
Yu Qiao
Jifeng Dai
LRM
163
93
1
15 Nov 2024
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu
Zhuohao Ni
Nicholas Crispino
Zihao Yu
Michael Bendersky
...
Ruoxi Jia
Xin Liu
Lingjuan Lyu
Dawn Song
Chenguang Wang
VLM
MLLM
116
0
0
15 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Hao Sun
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
74
4
0
14 Nov 2024
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
Xiao Liu
Lijun Zhang
Deepak Ganesan
Hui Guan
VLM
102
0
0
08 Nov 2024
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
VLM
74
1
0
07 Nov 2024
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Ziliang Gan
Yu Lu
D. Zhang
Haohan Li
Che Liu
...
Haipang Wu
Chaoyou Fu
Z. Xu
Rongjunchen Zhang
Yong Dai
113
13
0
05 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
142
4
0
05 Nov 2024
HumanVLM: Foundation for Human-Scene Vision-Language Model
Dawei Dai
Xu Long
Li Yutang
Zhang YuanHui
Shuyin Xia
VLM
MLLM
154
2
0
05 Nov 2024
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent
Yangning Li
Hai-Tao Zheng
Xinyu Wang
Yong Jiang
Zhen Zhang
...
Hui Wang
Hai-Tao Zheng
Pengjun Xie
Philip S. Yu
Fei Huang
164
23
0
05 Nov 2024
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering
Deepayan Das
Davide Talon
Massimiliano Mancini
Yiming Wang
Elisa Ricci
130
0
0
04 Nov 2024
Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models
Junjiao Tian
Chengyue Huang
Z. Kira
65
2
0
03 Nov 2024
Right this way: Can VLMs Guide Us to See More to Answer Questions?
Li Liu
Diji Yang
Sijia Zhong
Kalyana Suma Sree Tholeti
Lei Ding
Yi Zhang
Leilani H. Gilpin
134
3
0
01 Nov 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
80
1
0
31 Oct 2024
TurtleBench: A Visual Programming Benchmark in Turtle Geometry
Sina Rismanchian
Yasaman Razeghi
Sameer Singh
Shayan Doroudi
142
2
0
31 Oct 2024
Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model
Keito Sasagawa
Koki Maeda
Issa Sugiura
Shuhei Kurita
Naoaki Okazaki
Daisuke Kawahara
VLM
54
1
0
30 Oct 2024
SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset
Ngoc Dung Huynh
Mohamed Reda Bouadjenek
Sunil Aryal
Imran Razzak
Hakim Hacid
97
0
0
30 Oct 2024
Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
Badr AlKhamissi
Yingtian Tang
Abdülkadir Gökce
Johannes Mehrer
Martin Schrimpf
VLM
104
0
0
29 Oct 2024
Improving Generalization in Visual Reasoning via Self-Ensemble
Tien-Huy Nguyen
Quang-Khai Tran
Anh-Tuan Quang-Hoang
VLM
LRM
131
6
0
28 Oct 2024
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao
Yue Huang
Yanbo Wang
Jiayi Ye
Xiangqi Wang
Preslav Nakov
Mohamed Elhoseiny
Wei Wei
Mohamed Elhoseiny
Xiangliang Zhang
109
11
0
28 Oct 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
L. Qin
Qiguang Chen
Hao Fei
Zhi Chen
Min Li
Wanxiang Che
91
11
0
27 Oct 2024
Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava
Md Yousuf Harun
Robik Shrestha
Christopher Kanan
KELM
VLM
CLL
75
1
0
25 Oct 2024
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training
Haocheng Xi
Han Cai
Ligeng Zhu
Yaojie Lu
Kurt Keutzer
Jianfei Chen
Song Han
MQ
184
11
0
25 Oct 2024
Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant
A. S. Penamakuri
Anand Mishra
120
1
0
24 Oct 2024
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
Ahmed Heakl
Omkar Thawakar
Ali Alharthi
Ines Riahi
Abduljalil Saif
Jorma T. Laaksonen
Fahad Shahbaz Khan
Salman Khan
Rao Muhammad Anwer
89
3
0
24 Oct 2024
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
53
3
0
24 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
102
0
0
23 Oct 2024
CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov
Dmitrii Korzh
Alexey Zhavoronkin
Boris Mikheev
Denis Bobkov
Aibek Alanov
Oleg Y. Rogov
Ivan Oseledets
Elena Tutubalina
MU
AILaw
VLM
193
5
0
23 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Zeang Sheng
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
140
46
0
22 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
Bin Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
67
1
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
108
1
0
21 Oct 2024
Reducing Hallucinations in Vision-Language Models via Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Zou
VLM
LLMSV
170
9
0
21 Oct 2024
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Y. Cai
Jiangning Zhang
Haoyang He
Xinwei He
Ao Tong
Zhenye Gan
Chengjie Wang
Zhucun Xue
Yong-Jin Liu
X. Bai
VLM
98
6
0
21 Oct 2024
Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training
Rohan Saha
Abrar Fahim
Alona Fyshe
Alex Murphy
55
0
0
20 Oct 2024
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Deeparghya Dutta Barua
Md Sakib Ul Rahman Sourove
Md Fahim
Fabiha Haider
Fariha Tanjim Shifat
Md Tasmim Rahman Adib
Anam Borhan Uddin
Md Farhan Ishmam
Md Farhad Alam
93
0
0
19 Oct 2024
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
...
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
Jiankang Deng
MLLM
VLM
77
1
0
18 Oct 2024
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
Muhe Ding
Yang Ma
Pengda Qin
Jianlong Wu
Yuhong Li
Liqiang Nie
80
1
0
18 Oct 2024
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
Nghia Hieu Nguyen
Tho Thanh Quan
Ngan Luu-Thuy Nguyen
75
0
0
18 Oct 2024
MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps
Xiongtao Zhou
Jie He
Lanyu Chen
Jingyu Li
Haojing Chen
Víctor Gutiérrez-Basulto
Jeff Z. Pan
Ningyu Zhang
LRM
197
2
0
18 Oct 2024
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
Zifeng Zhu
Mengzhao Jia
Zizhuo Zhang
Lang Li
Meng Jiang
LRM
146
5
0
18 Oct 2024
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
Olga Loginova
Oleksandr Bezrukov
Ravi Shekhar
Alexey Kravets
68
2
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
259
31
0
18 Oct 2024
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen
Qingqing Cao
Qichen Fu
Sachin Mehta
Mahyar Najibi
VLM
127
5
0
17 Oct 2024
γ
−
γ-
γ
−
MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Yaxin Luo
Gen Luo
Jiayi Ji
Yiyi Zhou
Xiaoshuai Sun
Zhiqiang Shen
Rongrong Ji
VLM
MoE
100
1
0
17 Oct 2024
Previous
1
2
3
...
6
7
8
...
39
40
41
Next