Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,338 papers shown
Title
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
159
2
0
01 Jul 2025
A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning
Jiachen Zhong
Yiting Wang
Di Zhu
Ziwei Wang
LM&MA
AI4CE
48
1
0
01 Jul 2025
ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
VOS
MLLM
VGen
LRM
105
0
0
01 Jul 2025
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
Shiu-hong Kao
Yu-Wing Tai
Chi-Keung Tang
MLLM
LRM
283
1
0
01 Jul 2025
DreamCube: 3D Panorama Generation via Multi-plane Synchronization
Yukun Huang
Yanning Zhou
Jianan Wang
Kaiyi Huang
Xihui Liu
18
0
0
20 Jun 2025
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Lei Jiang
Zixun Zhang
Zizhou Wang
Xiaobing Sun
Zhen Li
Liangli Zhen
Xiaohua Xu
AAML
17
0
0
20 Jun 2025
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
Manuel Brack
Sudeep Katakol
Felix Friedrich
P. Schramowski
Hareesh Ravi
Kristian Kersting
Ajinkya Kale
20
0
0
20 Jun 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
16
0
0
20 Jun 2025
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
Fan Yang
Yousong Zhu
Xin Li
Yufei Zhan
Hongyin Zhao
Shurong Zheng
Yaowei Wang
Ming Tang
Jinqiao Wang
MLLM
VLM
40
0
0
20 Jun 2025
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Haoran Sun
Yankai Jiang
Wenjie Lou
Yujie Zhang
Wenjie Li
Lilong Wang
Mianxin Liu
Lei Liu
Xiaosong Wang
LRM
15
0
0
20 Jun 2025
AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models
Yuan Zhang
Chun-Kai Fan
Tao Huang
Ming Lu
Sicheng Yu
Junwen Pan
Kuan Cheng
Qi She
Shanghang Zhang
VLM
LRM
19
0
0
19 Jun 2025
MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models
Xingbai Chen
Tingchao Fu
Renyang Liu
Wei Zhou
Chao Yi
AAML
26
0
0
19 Jun 2025
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Zejun Ma
Chao Zhang
23
0
0
18 Jun 2025
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma
Yiqiao Jin
Vineeth Rakesh
Yingtong Dou
Menghai Pan
Mahashweta Das
Srijan Kumar
AAML
18
0
0
18 Jun 2025
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee
Ryo Hachiuma
Yong Man Ro
Yu-Chun Wang
Yueh-Hua Wu
VLM
38
0
0
18 Jun 2025
Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Shuo Xing
Lanqing guo
Hongyuan Hua
Seoyoung Lee
Peiran Li
Yufei Wang
Zhangyang Wang
Zhengzhong Tu
VLM
41
0
0
18 Jun 2025
Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation
Ruoyu Wang
Tong Yu
Junda Wu
Yao Liu
Julian McAuley
Lina Yao
15
0
0
18 Jun 2025
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Xuelin Shen
Jiayin Xu
Kangsheng Yin
Wenhan Yang
AAML
19
0
0
18 Jun 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria
Adinath Madhavrao Dukre
Feilong Tang
Sara Atito
Sudipta Roy
Muhammad Awais
Muhammad Haris Khan
Imran Razzak
VLM
42
0
0
18 Jun 2025
Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition
Jiamin Xie
Ju Lin
Yiteng Huang
Tyler Vuong
Zhaojiang Lin
...
Peng Su
Prashant Rawat
Sangeeta Srivastava
Ming Sun
Florian Metze
17
0
0
17 Jun 2025
NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving
Ren Xin
Hongji Liu
Xiaodong Mei
Wenru Liu
Maosheng Ye
Zhili Chen
Jun Ma
27
0
0
17 Jun 2025
Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems
Tuan Nguyen
Long-Vu Hoang
Huy-Dat Tran
12
0
0
16 Jun 2025
Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation
Utkarsh Bajpai
Julius Ruckin
Cyrill Stachniss
Marija Popović
15
0
0
16 Jun 2025
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Shaolei Zhang
Shoutao Guo
Qingkai Fang
Yan Zhou
Yang Feng
MLLM
AuLLM
VLM
51
0
0
16 Jun 2025
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments
Xuan Wang
Siyuan Liang
Zhe Liu
Yi Yu
Yuliang Lu
Xiaochun Cao
Ee-Chien Chang
X. Gao
AAML
70
0
0
16 Jun 2025
Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling
Daichi Tanaka
Takumi Karasawa
Shu Takenouchi
Rei Kawakami
18
0
0
16 Jun 2025
Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs
Gyutaek Oh
Seoyeon Kim
Sangjoon Park
Byung-Hoon Kim
LM&MA
LRM
31
0
0
16 Jun 2025
SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models
Xinyi Zhao
Congjing Zhang
Pei Guo
Wei Li
Lin Chen
Chaoyue Zhao
Shuai Huang
20
0
0
15 Jun 2025
Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency
Hiroshi Tanaka
Anika Rao
Hana Satou
Michael Johnson
Sofia García
18
0
0
15 Jun 2025
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models
Peiyuan Tang
Haojie Xin
Xiaodong Zhang
Jun Sun
Qin Xia
Zijiang Yang
VLM
19
0
0
15 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
30
0
0
13 Jun 2025
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Bo-Cheng Chiu
Jen-Jee Chen
Yu-Chee Tseng
Feng-Chi Chen
14
0
0
13 Jun 2025
Dynamic Double Space Tower
Weikai Sun
Shijie Song
Han Wang
15
0
0
13 Jun 2025
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Yuan Gao
Mattia Piccinini
Yuchen Zhang
Dingrui Wang
Korbinian Moller
...
Steven Peters
Andrea Stocco
Bassam Alrifaee
Marco Pavone
Johannes Betz
23
0
0
13 Jun 2025
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Fei Lin
Ziyang Gong
Cong Wang
Yonglin Tian
Tengchao Zhang
Xue Yang
Gen Luo
Fei Wang
124
0
0
12 Jun 2025
Can Sound Replace Vision in LLaVA With Token Substitution?
Ali Vosoughi
Jing Bi
Pinxin Liu
Yunlong Tang
Chenliang Xu
CLIP
VLM
131
0
0
12 Jun 2025
Improving Medical Visual Representation Learning with Pathological-level Cross-Modal Alignment and Correlation Exploration
Jun Wang
Lixing Zhu
Xiaohan Yu
A. Bhalerao
Yulan He
122
0
0
12 Jun 2025
LLMs Are Not Yet Ready for Deepfake Image Detection
Shahroz Tariq
David D. Nguyen
M.A.P. Chamikara
Tingmin Wu
A. Abuadbba
Kristen Moore
VLM
102
0
0
12 Jun 2025
Uncertainty-Aware Deep Learning for Automated Skin Cancer Classification: A Comprehensive Evaluation
Hamzeh Asgharnezhad
Pegah Tabarisaadi
Abbas Khosravi
R. Alizadehsani
Usha R. Acharya
122
0
0
12 Jun 2025
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
Liang Yin
Xudong Xie
Zhang Li
Xiang Bai
Yuliang Liu
LRM
117
0
0
12 Jun 2025
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu
Jiuhai Chen
Zhaojiang Lin
Xichen Pan
Lifu Huang
...
Di Jin
Michihiro Yasunaga
Lili Yu
Xi Lin
Shaoliang Nie
121
1
0
12 Jun 2025
LLM-to-Phy3D: Physically Conform Online 3D Object Generation with LLMs
Melvin Wong
Yueming Lyu
Thiago Rios
Stefan Menzel
Yew-Soon Ong
PINN
AI4CE
38
0
0
11 Jun 2025
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning
C. L. Philip Chen
Yunpeng Zhai
Yifan Zhao
Jinyang Gao
Bolin Ding
Jia Li
41
0
0
11 Jun 2025
A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation
Yukang Feng
Jianwen Sun
Chuanhao Li
Zizhen Li
Jiaxin Ai
...
Yifan Chang
Sizhuo Zhou
Shenglin Zhang
Yu Dai
Kaipeng Zhang
MLLM
EGVM
90
0
0
11 Jun 2025
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
Beomsik Cho
Jaehyung Kim
64
0
0
11 Jun 2025
Vision Generalist Model: A Survey
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
65
0
0
11 Jun 2025
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning
Yuting Li
Lai Wei
Kaipeng Zheng
Jingyuan Huang
Linghe Kong
Lichao Sun
Weiran Huang
AAML
LRM
VLM
80
0
0
11 Jun 2025
HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding
Yanzhao Shi
Xiaodan Zhang
Junzhong Ji
Haoning Jiang
Chengxin Zheng
Y. Wang
Liangqiong Qu
89
0
0
11 Jun 2025
Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu
Luis A. Leiva
19
0
0
10 Jun 2025
Bias Analysis in Unconditional Image Generative Models
Xiaofeng Zhang
Michelle Lin
Simon Lacoste-Julien
Aaron Courville
Yash Goyal
23
0
0
10 Jun 2025
1
2
3
4
...
45
46
47
Next