Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
Text-to-3D Generation using Jensen-Shannon Score Distillation
Khoi Do
Binh-Son Hua
DiffM
90
0
0
08 Mar 2025
SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai
Songyou Peng
Kyle Genova
Leonidas Guibas
Thomas Funkhouser
3DGS
147
1
0
08 Mar 2025
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin
Haoran Chen
Yue Fan
Yingqi Fan
Xin Jin
Hui Su
Jinlan Fu
Xiaoyu Shen
101
0
0
08 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
403
1
0
08 Mar 2025
Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models
Messi H.J. Lee
Soyeon Jeon
Jacob M. Montgomery
Calvin K. Lai
VLM
CoGe
86
0
0
07 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
196
1
0
07 Mar 2025
Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study
Nenad Petrovic
Yurui Zhang
Moaad Maaroufi
Kuo-Yi Chao
Lukasz Mazur
Fengjunjie Pan
Vahid Zolfaghari
Alois C. Knoll
120
0
0
06 Mar 2025
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding
Xihan Wang
Dianyi Yang
Yu Gao
Yufeng Yue
Yi Yang
M. Fu
3DGS
83
0
0
06 Mar 2025
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation
Aishik Konwer
Zhijian Yang
Erhan Bas
Cao Xiao
Prateek Prasanna
Parminder Bhatia
Taha A. Kass-Hout
MedIm
VLM
114
1
0
06 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
Han Wang
Can Huang
B. Li
VLM
114
4
0
06 Mar 2025
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Wenke Huang
Jian Liang
Xianda Guo
Yiyang Fang
Guancheng Wan
...
Bin Yang
He Li
Jiawei Shao
Mang Ye
Di Lin
OffRL
LRM
MLLM
KELM
VLM
161
4
0
06 Mar 2025
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning
Zhong Ji
Weilong Cao
Yan Zhang
Yanwei Pang
Jungong Han
Xuelong Li
DiffM
VLM
88
0
0
06 Mar 2025
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
Rui Zhao
Weijia Mao
Mike Zheng Shou
107
1
0
05 Mar 2025
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Huang Huang
Fangchen Liu
Letian Fu
Tingfan Wu
Mustafa Mukadam
Jitendra Malik
Ken Goldberg
Pieter Abbeel
LM&Ro
VLM
184
10
0
05 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
104
4
0
05 Mar 2025
BEVDriver: Leveraging BEV Maps in LLMs for Robust Closed-Loop Driving
Katharina Winter
Mark Azer
Fabian B. Flohr
121
2
0
05 Mar 2025
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
166
10
0
05 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
168
4
0
05 Mar 2025
Task-Agnostic Attacks Against Vision Foundation Models
Brian Pulfer
Yury Belousov
Vitaliy Kinakh
Teddy Furon
S. Voloshynovskiy
AAML
111
0
0
05 Mar 2025
Are Large Vision Language Models Good Game Players?
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
ELM
LRM
155
8
0
04 Mar 2025
MindSimulator: Exploring Brain Concept Localization via Synthetic FMRI
Guangyin Bao
Qi Zhang
Z. Gong
Zhuojia Wu
Duoqian Miao
104
1
0
04 Mar 2025
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments
Ege Özsoy
Chantal Pellegrini
Tobias Czempiel
Felix Tristram
Kun Yuan
David Bani-Harouni
U. Eck
Benjamin Busam
Matthias Keicher
Nassir Navab
125
4
0
04 Mar 2025
WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation
Dujun Nie
Xianda Guo
Yiqun Duan
Ruijun Zhang
Long Chen
LM&Ro
354
5
0
04 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li
Boyang Li
CoGe
188
1
0
03 Mar 2025
Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA
Z. Zhong
Yuli Wang
Lulu Bi
Zhuoqi Ma
S. H. Ahn
...
Webster Stayman
Todd M. Kolb
I. Kamel
Harrison X. Bai
Zhicheng Jiao
LM&MA
93
0
0
03 Mar 2025
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
Guotao Liang
Baoquan Zhang
Zhiyuan Wen
Junteng Zhao
Yunming Ye
Kola Ye
Yao He
96
0
0
03 Mar 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou
Ke Mei
Yu Lu
Tianyi Wang
Fengyun Rao
134
2
0
03 Mar 2025
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
Hao Tang
Chenwei Xie
Haiyang Wang
Xiaoyi Bao
Tingyu Weng
Pandeng Li
Yun Zheng
Liwei Wang
ObjD
VLM
134
1
0
03 Mar 2025
OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments
Qianwei Wang
Yifan Xu
V. Kamat
Carol Menassa
72
0
0
03 Mar 2025
Advancing vision-language models in front-end development via data synthesis
Tong Ge
Yashu Liu
Jieping Ye
Tianyi Li
Chao Wang
104
0
0
03 Mar 2025
Dementia Insights: A Context-Based MultiModal Approach
Sahar Sinene Mehdoui
Abdelhamid Bouzid
Daniel Sierra-Sosa
Adel Elmaghraby
105
0
0
03 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
123
3
0
03 Mar 2025
Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling
Jonathan Fhima
Jan Van Eijgen
Lennert Beeckmans
Thomas Jacobs
Moti Freiman
Luis Filipe Nakayama
Ingeborg Stalmans
Chaim Baskin
Joachim A. Behar
MedIm
178
0
0
03 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
VGen
124
0
0
03 Mar 2025
ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
Shizhan Liu
Hao Zheng
Hang Yu
Jianguo Li
DiffM
112
0
0
03 Mar 2025
Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think
Jie Tian
Xiaoye Qu
Zhenyi Lu
Xiaoye Qu
Sichen Liu
Yu Cheng
DiffM
VGen
81
4
0
02 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
179
2
0
02 Mar 2025
HalCECE: A Framework for Explainable Hallucination Detection through Conceptual Counterfactuals in Image Captioning
Maria Lymperaiou
Giorgos Filandrianos
Angeliki Dimitriou
Athanasios Voulodimos
Giorgos Stamou
MLLM
56
0
0
01 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
116
9
0
01 Mar 2025
Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning
Hanxun Yu
Wentong Li
Song Wang
Jintai Chen
Jianke Zhu
3DV
LRM
159
6
0
01 Mar 2025
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
Song Xia
Yi Yu
Wenhan Yang
Meiwen Ding
Zhuo Chen
Lingyu Duan
Alex C. Kot
Xudong Jiang
116
4
0
01 Mar 2025
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
120
5
0
28 Feb 2025
SafeText: Safe Text-to-image Models via Aligning the Text Encoder
Yuepeng Hu
Zhengyuan Jiang
Neil Zhenqiang Gong
101
5
0
28 Feb 2025
RTGen: Real-Time Generative Detection Transformer
Chi Ruan
ObjD
VLM
80
0
0
28 Feb 2025
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
J.N. Zhang
Xuan Yang
Tianfu Wang
Yu Yao
Aleksandr Petiushko
B. Li
131
0
0
28 Feb 2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji
Huajie Tan
Jiayu Shi
Xiaoshuai Hao
Yuan Zhang
...
Huaihai Lyu
Xiaolong Zheng
Jiaming Liu
Zhongyuan Wang
Shanghang Zhang
187
15
0
28 Feb 2025
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
Chenhe Gu
Jindong Gu
Andong Hua
Yao Qin
AAML
88
0
0
27 Feb 2025
Data Distributional Properties As Inductive Bias for Systematic Generalization
Felipe del-Rio
Alain Raymond-Sáez
Daniel Florea
Rodrigo Toro Icarte
Julio Hurtado
Cristian B. Calderon
Á. Soto
AI4CE
115
1
0
27 Feb 2025
Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation
Kang Liu
Zhuoqi Ma
Xiaolu Kang
Yunan Li
Kun Xie
Zhicheng Jiao
Qiguang Miao
89
4
0
27 Feb 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Zhongyang Li
Ziyue Li
Dinesh Manocha
MoE
144
0
0
27 Feb 2025
Previous
1
2
3
...
13
14
15
...
45
46
47
Next