Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.01449
Cited By
v1
v2
v3
v4
v5 (latest)
ColPali: Efficient Document Retrieval with Vision Language Models
27 June 2024
Manuel Faysse
Hugues Sibille
Tony Wu
Bilel Omrani
Gautier Viaud
C´eline Hudelot
Pierre Colombo
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ColPali: Efficient Document Retrieval with Vision Language Models"
50 / 68 papers shown
Title
MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval
Mingjun Xu
Jinhan Dong
Jue Hou
Zehui Wang
Sihang Li
Zhifeng Gao
Renxin Zhong
Hengxing Cai
AI4TS
LRM
21
0
0
14 Jun 2025
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin
Xudong Xie
Zhang Li
Xiang Bai
Yuliang Liu
LRM
112
0
0
12 Jun 2025
MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems
Peiru Yang
Jinhua Yin
Haoran Zheng
Xueying Bai
Huili Wang
Yufei Sun
Xintian Li
Shangguang Wang
Yongfeng Huang
Tao Qi
AAML
15
0
0
09 Jun 2025
CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
David Wan
Han Wang
Elias Stengel-Eskin
Jaemin Cho
Mohit Bansal
VLM
Presented at
ResearchTrend Connect | VLM
on
02 Jul 2025
43
0
0
06 Jun 2025
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Qiuchen Wang
Ruixue Ding
Y. Zeng
Zehui Chen
Lin Yen-Chen
Shihang Wang
Pengjun Xie
Fei Huang
Feng Zhao
VLM
LRM
81
0
0
28 May 2025
Multimodal RAG-driven Anomaly Detection and Classification in Laser Powder Bed Fusion using Large Language Models
Kiarash Naghavi Khanghah
Zhiling Chen
Lela Romeo
Qian Yang
Rajiv Malhotra
Farhad Imani
Hongyi Xu
132
0
0
20 May 2025
Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning
François Role
Sébastien Meyer
Victor Amblard
VLM
134
0
0
06 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
251
7
0
26 Apr 2025
ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring
Kaili Huang
Thejas Venkatesh
Uma Dingankar
Antonio Mallia
Daniel Campos
...
Matei A. Zaharia
Kwabena Boahen
Omar Khattab
Saarthak Sarup
Keshav Santhanam
112
0
0
21 Apr 2025
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Ryota Tanaka
Taichi Iki
Taku Hasegawa
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
116
6
0
14 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
99
16
0
07 Apr 2025
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Ezzeldin Shereen
Dan Ristea
Burak Hasircioglu
Shae McFadden
V. Mavroudis
Chris Hicks
178
0
0
02 Apr 2025
Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook
Xu Zheng
Ziqiao Weng
Yuanhuiyi Lyu
Lutao Jiang
Haiwei Xue
Bin Ren
Danda Pani Paudel
N. Sebe
Luc Van Gool
Xuming Hu
3DV
143
10
0
23 Mar 2025
Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data
Anatole Callies
Quentin Bodinier
Philippe Ravaud
Kourosh Davarpanah
94
0
0
19 Mar 2025
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
S. Han
Peng Xia
Ruiyi Zhang
Tong Sun
Yun Li
Hongtu Zhu
Huaxiu Yao
VLM
186
8
0
18 Mar 2025
VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
Zhi Rui Tam
Ya-Ting Pai
Yen-Wei Lee
Yun-Nung Chen
CoGe
173
0
0
13 Mar 2025
Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models
Jonathan Bourne
194
0
0
24 Feb 2025
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
Amirhosein Zobeiri
Mahdi Dehghani
Mohammadali Mohammadkhani
Bardia Mohammadi
Omid Ghahroodi
M. Baghshah
Ehsaneddin Asgari
RALM
348
7
0
12 Feb 2025
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren
Lingrui Xu
Long Xia
Shuaiqiang Wang
Dawei Yin
Chao Huang
VGen
VLM
166
5
0
03 Feb 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Wenhu Chen
MLLM
VLM
195
29
0
03 Jan 2025
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
230
20
0
22 Dec 2024
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework
Yew Ken Chia
Liying Cheng
Hou Pong Chan
Chaoqun Liu
Maojia Song
Sharifah Mahani Aljunied
Soujanya Poria
Lidong Bing
RALM
VLM
106
6
0
09 Nov 2024
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
Sheng-Chieh Lin
Chankyu Lee
Mohammad Shoeybi
Jimmy J. Lin
Bryan Catanzaro
Ming-Yu Liu
301
20
0
04 Nov 2024
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Wei He
Zhiheng Xi
Wanxu Zhao
Xiaoran Fan
Yiwen Ding
Zifei Shan
Tao Gui
Qi Zhang
Xuanjing Huang
LRM
162
8
0
24 Oct 2024
Unified Multi-Modal Interleaved Document Representation for Information Retrieval
Jaewoo Lee
Joonho Ko
Jinheon Baek
Soyeong Jeong
Sung Ju Hwang
106
2
0
03 Oct 2024
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Benjamin Clavié
Antoine Chaffin
Griffin Adams
48
4
0
23 Sep 2024
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Andreas Koukounas
Georgios Mastrapas
Michael Gunther
Bo Wang
Scott Martens
...
Saahil Ognawala
Susana Guzman
Maximilian Werk
Nan Wang
Han Xiao
VLM
102
18
0
30 May 2024
What matters when building vision-language models?
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
105
177
0
03 May 2024
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li
Yuqi Wang
Runxin Xu
Peiyi Wang
Xiachong Feng
Lingpeng Kong
Qi Liu
120
58
0
01 Mar 2024
Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism
Hippolyte Gisserot-Boukhlef
Manuel Faysse
Emmanuel Malherbe
C´eline Hudelot
Pierre Colombo
173
4
0
20 Feb 2024
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlv Chen
Shitao Xiao
Peitian Zhang
Kun Luo
Defu Lian
Zheng Liu
702
443
0
05 Feb 2024
Improving Text Embeddings with Large Language Models
Liang Wang
Nan Yang
Xiaolong Huang
Linjun Yang
Rangan Majumder
Furu Wei
SyDa
133
189
0
31 Dec 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
330
960
0
27 Nov 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
124
104
0
13 Oct 2023
Improved Baselines with Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
222
2,830
0
05 Oct 2023
Vision Transformers Need Registers
Zilong Chen
Maxime Oquab
Julien Mairal
Huaping Liu
ViT
196
356
0
28 Sep 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
189
945
0
24 Aug 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
LRM
122
1,336
0
17 Jul 2023
Large Language Models
Michael R Douglas
LLMAG
LM&MA
172
645
0
11 Jul 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
129
64
0
22 May 2023
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
579
4,942
0
17 Apr 2023
Rethinking the Role of Token Retrieval in Multi-Vector Retrieval
Jinhyuk Lee
Zhuyun Dai
Sai Meher Karthik Duddu
Tao Lei
Iftekhar Naim
Ming-Wei Chang
Vincent Zhao
110
18
0
04 Apr 2023
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIP
VLM
291
1,206
0
27 Mar 2023
Retrieving Multimodal Information for Augmented Generation: A Survey
Ruochen Zhao
Hailin Chen
Weishi Wang
Fangkai Jiao
Do Xuan Long
...
Bosheng Ding
Xiaobao Guo
Minzhi Li
Xingxuan Li
Shafiq Joty
126
88
0
20 Mar 2023
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang
Nan Yang
Xiaolong Huang
Binxing Jiao
Linjun Yang
Daxin Jiang
Rangan Majumder
Furu Wei
VLM
259
624
0
07 Dec 2022
Unifying Vision, Text, and Layout for Universal Document Processing
Zineng Tang
Ziyi Yang
Guoxin Wang
Yuwei Fang
Yang Liu
Chenguang Zhu
Michael Zeng
Chao-Yue Zhang
Joey Tianyi Zhou
VLM
131
115
0
05 Dec 2022
Towards Complex Document Understanding By Discrete Reasoning
Fengbin Zhu
Wenqiang Lei
Fuli Feng
Chao Wang
Haozhou Zhang
Tat-Seng Chua
109
48
0
25 Jul 2022
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset
Ashish V. Thapliyal
Jordi Pont-Tuset
Xi Chen
Radu Soricut
VGen
175
78
0
25 May 2022
PLAID: An Efficient Engine for Late Interaction Retrieval
Keshav Santhanam
Omar Khattab
Christopher Potts
Matei A. Zaharia
VLM
119
76
0
19 May 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
420
3,617
0
29 Apr 2022
1
2
Next