Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.17981
Cited By
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
31 January 2024
Qirui Jiao
Daoyuan Chen
Yilun Huang
Yaliang Li
Ying Shen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study"
12 / 12 papers shown
Title
VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding
Jiaqi Wang
Yifei Gao
Jitao Sang
MLLM
147
2
0
24 Nov 2024
What matters when building vision-language models?
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
58
166
0
03 May 2024
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai
Pichao Wang
Tianjun Xiao
Tong He
Zongbo Han
Zheng Zhang
Mike Zheng Shou
VLM
LRM
116
167
0
29 Apr 2024
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Gen Luo
Yiyi Zhou
Tianhe Ren
Shen Chen
Xiaoshuai Sun
Rongrong Ji
VLM
MLLM
36
93
0
24 May 2023
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
Dustin Schwenk
Apoorv Khandelwal
Christopher Clark
Kenneth Marino
Roozbeh Mottaghi
32
524
0
03 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
228
3,458
0
29 Apr 2022
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
Hao Zhang
Feng Li
Shilong Liu
Lei Zhang
Hang Su
Jun Zhu
L. Ni
H. Shum
ViT
115
1,399
0
07 Mar 2022
PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System
Yuning Du
Chenxia Li
Ruoyu Guo
Cheng Cui
Weiwei Liu
...
Yehua Yang
Qiwen Liu
Xiaoguang Hu
Dianhai Yu
Yanjun Ma
25
66
0
07 Sep 2021
TextCaps: a Dataset for Image Captioning with Reading Comprehension
Oleksii Sidorov
Ronghang Hu
Marcus Rohrbach
Amanpreet Singh
39
406
0
24 Mar 2020
Towards VQA Models That Can Read
Amanpreet Singh
Vivek Natarajan
Meet Shah
Yu Jiang
Xinlei Chen
Dhruv Batra
Devi Parikh
Marcus Rohrbach
EgoV
35
1,174
0
18 Apr 2019
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal
Tejas Khot
D. Summers-Stay
Dhruv Batra
Devi Parikh
CoGe
276
3,187
0
02 Dec 2016
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Julia Hockenmaier
Svetlana Lazebnik
142
2,033
0
19 May 2015
1