Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,340 papers shown
Title
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma
Weiming Ren
Yiming Jia
Zhuofeng Li
Ping Nie
Ge Zhang
Wenhu Chen
75
1
0
20 May 2025
Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Chun-Yi Kuan
Hung-yi Lee
88
1
0
20 May 2025
TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks
Yuanze Hu
Zhaoxin Fan
Xinyu Wang
Gen Li
Ye Qiu
...
Wenjun Wu
Kejian Wu
Yifan Sun
Xiaotie Deng
Jin Song Dong
62
0
0
19 May 2025
VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection
Aditya Taparia
Noel Ngu
Mario Leiva
Joshua Shay Kricheli
John Corcoran
Nathaniel D. Bastian
Gerardo Simari
Paulo Shakarian
Ransalu Senanayake
ObjD
82
0
0
19 May 2025
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
Haiquan Wen
Yiwei He
Zhenglin Huang
Tianxiao Li
Zihan Yu
Xingru Huang
Lu Qi
Baoyuan Wu
Xuelong Li
Guangliang Cheng
VGen
109
0
0
19 May 2025
FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks
Zihua Wang
Ruibo Li
Haozhe Du
Joey Tianyi Zhou
Yu Zhang
Xu Yang
MLLM
131
0
0
19 May 2025
GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language Matching
Barkin Dagda
Muhammad Awais
Saber Fallah
101
0
0
19 May 2025
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models
Matteo Merler
Nicola Dainese
Minttu Alakuijala
Giovanni Bonetta
Pietro Ferrazzi
Yu Tian
Bernardo Magnini
Pekka Marttinen
LM&Ro
VLM
117
0
0
19 May 2025
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
86
0
0
19 May 2025
From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
Lincan Cai
Jingxuan Kang
Shuang Li
Wenxuan Ma
Binhui Xie
Zhida Qin
Jian Liang
VLM
88
0
0
19 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
104
0
0
19 May 2025
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Yang Liu
Ming Ma
Xiaomin Yu
Pengxiang Ding
Han Zhao
Mingyang Sun
Siteng Huang
Donglin Wang
LRM
199
0
0
18 May 2025
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning
Bonan li
Zicheng Zhang
Songhua Liu
Weihao Yu
Xinchao Wang
VLM
142
0
0
17 May 2025
IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests
Tan-Hanh Pham
Phu-Vinh Nguyen
Dang The Hung
Bui Trong Duong
Vu Nguyen Thanh
Chris Ngo
Tri Quang Truong
Truong-Son Hy
ReLM
CoGe
VLM
LRM
64
0
0
17 May 2025
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings
Jiajun Qin
Yuan Pu
Zhuolun He
Seunggeun Kim
David Z. Pan
Bei Yu
106
0
0
17 May 2025
Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
Wei Zhao
Gongsheng Li
Zhefei Gong
Pengxiang Ding
Han Zhao
Donglin Wang
LM&Ro
78
0
0
16 May 2025
Geofenced Unmanned Aerial Robotic Defender for Deer Detection and Deterrence (GUARD)
Ebasa Temesgen
Mario Jerez
Greta Brown
Graham Wilson
Sree Ganesh Lalitaditya Divakarla
Sarah Boelter
Oscar Nelson
Robert McPherson
Maria Gini
73
0
0
16 May 2025
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
Yansheng Qiu
Li Xiao
Zhaopan Xu
Pengfei Zhou
Zheng Wang
Kai Zhang
ELM
LRM
144
0
0
16 May 2025
Breaking the Batch Barrier (B3) of Contrastive Learning via Smart Batch Mining
Raghuveer Thirukovalluru
Rui Meng
Yang Liu
Karthikeyan K
Mingyi Su
Ping Nie
Semih Yavuz
Yingbo Zhou
Wenhu Chen
Bhuwan Dhingra
85
1
0
16 May 2025
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Yang Liu
Shengfang Zhai
Mingzhe Du
Yulin Chen
Tri Cao
...
Xuzhao Li
Kun Wang
Junfeng Fang
Jiaheng Zhang
Bryan Hooi
OffRL
LRM
107
3
0
16 May 2025
Multimodal Event Detection: Current Approaches and Defining the New Playground through LLMs and VLMs
Abhishek Dey
Aabha Bothera
Samhita Sarikonda
Rishav Aryan
Sanjay Kumar Podishetty
Akshay Havalgi
Gaurav Singh
Saurabh Srivastava
79
0
0
16 May 2025
Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding
Jianhao Huang
Qunsong Zeng
Kaibin Huang
DiffM
87
0
0
15 May 2025
StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation
Daniel A. P. Oliveira
David Martins de Matos
VGen
71
0
0
15 May 2025
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Ranjan Sapkota
Konstantinos I. Roumeliotis
Manoj Karkee
AI4TS
143
6
0
15 May 2025
Variational Visual Question Answering
Tobias Jan Wieczorek
Nathalie Daun
Mohammad Emtiyaz Khan
Marcus Rohrbach
OOD
92
0
0
14 May 2025
A Multimodal Multi-Agent Framework for Radiology Report Generation
Ziruo Yi
Ting Xiao
Mark V. Albert
MedIm
58
0
0
14 May 2025
Bias and Generalizability of Foundation Models across Datasets in Breast Mammography
Elodie Germani
Selin Türk Ilayda
Zeineddine Fatima
Mourad Charbel
Shadi Albarqouni
AI4CE
115
0
0
14 May 2025
Controllable Image Colorization with Instance-aware Texts and Masks
Yanru An
Ling Gui
Qiang Hu
Chunlei Cai
Tianxiao Ye
Xiaoyun Zhang
Yanfeng Wang
DiffM
56
0
0
13 May 2025
Leveraging Multi-Modal Information to Enhance Dataset Distillation
Zhe Li
Hadrien Reynaud
Bernhard Kainz
DD
101
0
0
13 May 2025
FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units
Jian Wang
Baoyuan Wu
Li Liu
Qingshan Liu
AAML
85
0
0
13 May 2025
CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding
Wenxuan Ma
Xiaoge Cao
Yize Zhang
Chaofan Zhang
Shaobo Yang
Peng Hao
Bin Fang
Yinghao Cai
Shaowei Cui
Shuo Wang
123
0
0
13 May 2025
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su
Linjie Li
Mingyang Song
Yunzhuo Hao
Zhengyuan Yang
...
Guanjie Chen
Jiawei Gu
Juntao Li
Xiaoye Qu
Yu Cheng
OffRL
LRM
84
11
0
13 May 2025
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving
Zongchuang Zhao
Haoyu Fu
Dingkang Liang
Xin Zhou
Dingyuan Zhang
Hongwei Xie
Bing Wang
Xiang Bai
MLLM
VLM
125
0
0
13 May 2025
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Donghoon Kim
Minji Bae
Kyuhong Shim
B. Shim
75
1
0
13 May 2025
Behind Maya: Building a Multilingual Vision Language Model
Nahid Alam
Karthik Reddy Kanjula
Surya Guthikonda
Timothy Chung
Bala Krishna S Vegesna
...
Isha Chaturvedi
Genta Indra Winata
Ashvanth.S
Snehanshu Mukherjee
Alham Fikri Aji
MLLM
VLM
78
0
0
13 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Hao Peng
Tong Zhang
Heng Ji
VLM
79
0
0
13 May 2025
Visually Interpretable Subtask Reasoning for Visual Question Answering
Yu Cheng
A. Goel
Hakan Bilen
LRM
68
0
0
12 May 2025
Visual Instruction Tuning with Chain of Region-of-Interest
Yixin Chen
Shuai Zhang
Boran Han
Bernie Wang
82
0
0
11 May 2025
DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models
Shucheng Huang
Freda Shi
Chen Sun
Jiaming Zhong
Minghao Ning
Yufeng Yang
Yukun Lu
Hong Wang
A. Khajepour
95
0
0
11 May 2025
METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection
Yongqi Wang
Xinxiao Wu
Shuo Yang
ObjD
71
0
0
10 May 2025
LLM-Land: Large Language Models for Context-Aware Drone Landing
Siwei Cai
Yuwei Wu
Lifeng Zhou
68
0
0
09 May 2025
Describe Anything in Medical Images
Xi Xiao
Yunbei Zhang
Thanh-Huy Nguyen
Ba Thinh Lam
Janet Wang
...
Xiaobei Wang
Xiao Wang
Hao Xu
Tianming Liu
Min Xu
MedIm
VLM
189
0
0
09 May 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping Huang
OffRL
150
0
0
08 May 2025
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Han Xiao
Yina Xie
Guanxin Tan
Yinghao Chen
R. Hu
...
Peng Gao
Yafei Wen
Xiaoxin Chen
Shuai Ren
Hongsheng Li
VLM
81
1
0
08 May 2025
PADriver: Towards Personalized Autonomous Driving
Genghua Kou
Fan Jia
Weixin Mao
Yang Liu
Yucheng Zhao
Ziheng Zhang
Osamu Yoshie
Tiancai Wang
You Li
Xinming Zhang
107
0
0
08 May 2025
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
Ahmed Abdelreheem
Filippo Aleotti
Jamie Watson
Z. Qureshi
Abdelrahman Eldesokey
Peter Wonka
Gabriel J. Brostow
Sara Vicente
Guillermo Garcia-Hernando
DiffM
141
0
0
08 May 2025
Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction
Li Yuan
Yi Cai
Xudong Shen
Qing Li
Qingbao Huang
Zikun Deng
Tao Wang
MoMe
OffRL
MoE
91
0
0
08 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
157
1
0
08 May 2025
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo
Patrícia Izar
Irene Delval
Victor de Napole Gregolin
Nina S. T. Hirata
VGen
78
0
0
08 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
184
1
0
08 May 2025
Previous
1
2
3
4
5
6
...
45
46
47
Next