Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,338 papers shown
Title
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng
Ziyuan Huang
Kaixiang Ji
Yichao Yan
VLM
240
1
0
26 Mar 2025
Vision as LoRA
Han Wang
Yongjie Ye
Bingru Li
Yuxiang Nie
Jinghui Lu
Jingqun Tang
Yanjie Wang
Can Huang
140
2
0
26 Mar 2025
Dynamic Pyramid Network for Efficient Multimodal Large Language Model
Hao Ai
Kunyi Wang
Zezhou Wang
H. Lu
Jin Tian
Yaxin Luo
Peng-Fei Xing
Jen-Yuan Huang
Huaxia Li
Gen Luo
MLLM
VLM
173
0
0
26 Mar 2025
Qwen2.5-Omni Technical Report
Jin Xu
Zhifang Guo
Jinzheng He
Hangrui Hu
Ting He
...
K. Dang
Bin Zhang
Xinyu Wang
Yunfei Chu
Junyang Lin
VGen
AuLLM
164
55
0
26 Mar 2025
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Zitian Wang
Yue Liao
Kang Rong
Fengyun Rao
Yibo Yang
Si Liu
115
0
0
26 Mar 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
219
2
0
26 Mar 2025
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines
Chen Tang
Xinzhu Ma
Encheng Su
Xiufeng Song
Xiaohong Liu
Wei-Hong Li
Lei Bai
Wanli Ouyang
Xiangyu Yue
3DGS
AI4TS
102
0
0
26 Mar 2025
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Xiao Guo
Xiufeng Song
Yue Zhang
Xiaohong Liu
Xuyang Liu
143
1
0
26 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
154
0
0
26 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
103
0
0
25 Mar 2025
LangBridge: Interpreting Image as a Combination of Language Embeddings
Jiaqi Liao
Yuwei Niu
Fanqing Meng
Hao Li
Changyao Tian
...
Dianqi Li
X. Zhu
Li Yuan
Jifeng Dai
Yu Cheng
MLLM
150
1
0
25 Mar 2025
Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications
Ben Rahman
VLM
84
3
0
25 Mar 2025
LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text
Weizhi Chen
Jingbo Chen
Yupeng Deng
Jiansheng Chen
Yuman Feng
Zhihao Xi
Diyou Liu
Kai Li
Yu Meng
VLM
102
1
0
25 Mar 2025
Improved Alignment of Modalities in Large Vision Language Models
Kartik Jangra
Aman Kumar Singh
Yashwani Mann
Geetanjali Rathee
VLM
84
0
0
25 Mar 2025
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
Zhiyao Ren
Yibing Zhan
B. Yu
Dacheng Tao
DiffM
95
0
0
25 Mar 2025
ImageSet2Text: Describing Sets of Images through Text
Piera Riccio
F. Galati
Kajetan Schweighofer
Noa Garcia
Nuria Oliver
VLM
CoGe
117
0
0
25 Mar 2025
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Fucai Ke
Vijay Kumar B G
Xingjian Leng
Zhixi Cai
Zaid Khan
Weiqing Wang
P. D. Haghighi
H. Rezatofighi
Manmohan Chandraker
159
1
0
25 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
134
1
0
25 Mar 2025
Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation
Niccolo Avogaro
Thomas Frick
Mattia Rigotti
Andrea Bartezzaghi
Filip M. Janicki
Cristiano Malossi
Konrad Schindler
Roy Assaf
MLLM
VLM
104
1
0
25 Mar 2025
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model
Zhiqiang Zhang
Jia-Nan Li
Zunnan Xu
Hanhui Li
Yiji Cheng
Fa-Ting Hong
Qin Lin
Qinglin Lu
Xiaodan Liang
DiffM
140
2
0
25 Mar 2025
DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data
Liming Zheng
Feng Yan
Fanfan Liu
C. Feng
Yufeng Zhong
Yiyang Huang
Lin Ma
106
0
0
25 Mar 2025
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Zhi Hou
Tianyi Zhang
Yuwen Xiong
Haonan Duan
Hengjun Pu
...
Chengyang Zhao
X. Zhu
Yu Qiao
Jifeng Dai
Yuxiao Chen
139
6
0
25 Mar 2025
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
Zhenyu Pan
Han Liu
OffRL
LRM
139
7
0
24 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
213
1
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Helen Zhou
Bo Yuan
VLM
155
2
0
24 Mar 2025
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Marco Garosi
Alessandro Conti
Gaowen Liu
Elisa Ricci
Massimiliano Mancini
ObjD
VLM
103
0
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
274
9
0
24 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yue Yang
Afshin Dehghan
153
5
0
24 Mar 2025
VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
Wencheng Zhu
Yuexin Wang
Hongxuan Li
Pengfei Zhu
Q. Hu
CLIP
111
0
0
24 Mar 2025
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks
Wenhao You
Bryan Hooi
Yiwei Wang
Yansen Wang
Zong Ke
Ming Yang
Zi Huang
Yujun Cai
AAML
100
0
0
24 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Jinjin Zhang
Guodong Wang
Yizhou Jin
Di Huang
90
2
0
24 Mar 2025
Context-Enhanced Memory-Refined Transformer for Online Action Detection
Zhanzhong Pang
Fadime Sener
Angela Yao
OffRL
125
2
0
24 Mar 2025
On the Perception Bottleneck of VLMs for Chart Understanding
Junteng Liu
Weihao Zeng
Xiwen Zhang
Yijun Wang
Zifei Shan
Junxian He
102
0
0
24 Mar 2025
Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models
Jinho Jeong
Sangmin Han
Jinwoo Kim
Seon Joo Kim
72
1
0
24 Mar 2025
Panorama Generation From NFoV Image Done Right
Dian Zheng
Cheng Zhang
Xiao-Ming Wu
Cao Li
Chengfei Lv
Jian-Fang Hu
Wei-Shi Zheng
DiffM
132
2
0
24 Mar 2025
Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models
Qiao Liang
Yanjiang Liu
Xianpei Han
Yaojie Lu
Hongyu Lin
Jia Zheng
Jia Zheng
Le Sun
Le Sun
Yingfei Sun
97
0
0
23 Mar 2025
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Zefeng Zhang
Hengzhu Tang
Shuaiyi Nie
Zhenyu Zhang
Yiming Ren
Zhenyang Li
Dawei Yin
Duohe Ma
Tingwen Liu
117
1
0
23 Mar 2025
FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation
Dong Zhao
Jinlong Li
Shuang Wang
Mengyao Wu
Qi Zang
N. Sebe
Zhun Zhong
461
1
0
23 Mar 2025
MAO: Efficient Model-Agnostic Optimization of Prompt Tuning for Vision-Language Models
Haoyang Li
Siyu Zhou
Liang Wang
Guodong Long
VLM
117
0
0
23 Mar 2025
Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes
Kelly O. Marshall
Omid Poursaeed
Sergiu Oprea
Amit Kumar
Anushrut Jignasu
Chinmay Hegde
Yilei Li
Rakesh Ranjan
3DV
104
0
0
23 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
104
1
0
22 Mar 2025
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
Yuchen Sun
Shanhui Zhao
Tao Yu
Hao Wen
Samith Va
Mengwei Xu
Yan Liang
Chongyang Zhang
LLMAG
124
3
0
22 Mar 2025
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
Shulei Wang
Wang Lin
Hai Huang
Hanting Wang
Sihang Cai
...
Tao Jin
Jingyuan Chen
Jiacheng Sun
Jieming Zhu
Zhou Zhao
DiffM
128
3
0
22 Mar 2025
GOAL: Global-local Object Alignment Learning
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
410
0
0
22 Mar 2025
LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models
Jian Liang
Wenke Huang
Guancheng Wan
Qu Yang
Mang Ye
MoMe
CLL
AI4CE
122
5
0
21 Mar 2025
Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation
Giacomo Savazzi
Eugenio Lomurno
Cristian Sbrolli
Agnese Chiatti
Matteo Matteucci
81
0
0
21 Mar 2025
HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation
Hou In Derek Pun
Hou In Ivan Tam
Austin T. Wang
Xiaoliang Huo
Angel X. Chang
Manolis Savva
3DV
103
1
0
21 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Vishwesh Ramanathan
Tony Xu
Pushpak Pati
Faruk Ahmed
Maged Goubran
Anne L. Martel
80
0
0
21 Mar 2025
Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval
Yuanmin Tang
Jing Yu
Keke Gai
Jiamin Zhuang
Gang Xiong
Gaopeng Gou
Qi Wu
VGen
178
2
0
21 Mar 2025
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
Kaisi Guan
Zhengfeng Lai
Yizhou Sun
Peng Zhang
Wei Liu
Kieran Liu
Meng Cao
Ruihua Song
VGen
93
0
0
21 Mar 2025
Previous
1
2
3
...
9
10
11
...
45
46
47
Next