Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,338 papers shown
Title
D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
Panpan Wang
Liqiang Niu
Fandong Meng
Jinan Xu
Yufeng Chen
Jie Zhou
DiffM
108
0
0
21 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Vishwesh Ramanathan
Tony Xu
Pushpak Pati
Faruk Ahmed
Maged Goubran
Anne L. Martel
80
0
0
21 Mar 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi
Jiawei Liu
Hao Tang
Zhigang Zhu
165
4
0
21 Mar 2025
HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
Mengtian Li
Jinshu Chen
Wanquan Feng
Bingchuan Li
Fei Dai
Mingcong Liu
Qian He
3DH
88
0
0
21 Mar 2025
HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation
Hou In Derek Pun
Hou In Ivan Tam
Austin T. Wang
Xiaoliang Huo
Angel X. Chang
Manolis Savva
3DV
103
1
0
21 Mar 2025
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Hui Liu
Wenya Wang
Kecheng Chen
Jie Liu
Yibing Liu
Tiexin Qin
Peisong He
Xinghao Jiang
Haoliang Li
BDL
VLM
475
0
0
20 Mar 2025
EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Zihao Zhang
Haoran Chen
Haoyu Zhao
Guansong Lu
Yanwei Fu
Hang Xu
Zuxuan Wu
VGen
DiffM
173
2
0
20 Mar 2025
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Jie M. Zhang
Zheng Yuan
Ziyi Wang
Bei Yan
Sibo Wang
Xiangkui Cao
Zonghui Guo
Shiguang Shan
Xilin Chen
ELM
137
0
0
20 Mar 2025
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Liming Jiang
Qing Yan
Yumin Jia
Zichuan Liu
Hao Kang
Xin Lu
110
4
0
20 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLM
MQ
139
2
0
20 Mar 2025
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
Zenghui Yuan
Jiawen Shi
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
AAML
163
3
0
20 Mar 2025
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
Eduard Allakhverdov
Elizaveta Goncharova
Andrey Kuznetsov
74
0
0
20 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
Han Wang
Kai Hu
Liangcai Gao
341
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
114
2
0
20 Mar 2025
Single Image Iterative Subject-driven Generation and Editing
Yair Shpitzer
Gal Chechik
Idan Schwartz
89
0
0
20 Mar 2025
Unleashing Vecset Diffusion Model for Fast Shape Generation
Zeqiang Lai
Yunfei Zhao
Zibo Zhao
Haolin Liu
Fuyun Wang
...
Jinwei Huang
Yuhong Liu
Jie Jiang
Chunchao Guo
Xiangyu Yue
DiffM
505
2
0
20 Mar 2025
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions
Xiaomeng Chu
Jiajun Deng
Guoliang You
Wei Liu
Xuzhao Li
Jianmin Ji
Yanzhe Zhang
132
0
0
20 Mar 2025
A Vision Centric Remote Sensing Benchmark
Abduljaleel Adejumo
Faegheh Yeganli
Clifford Broni-bediako
Aoran Xiao
Naoto Yokoya
Mennatullah Siam
146
0
0
20 Mar 2025
Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation
Clive Tinashe Marimo
Benedikt Blumenstiel
Maximilian Nitsche
Johannes Jakubik
Thomas Brunschwiler
87
1
0
20 Mar 2025
Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection
Peipeng Yu
Jianwei Fei
Hui Gao
Xuan Feng
Zhihua Xia
Chip-Hong Chang
MLLM
VLM
107
1
0
19 Mar 2025
Continual Multimodal Contrastive Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
CLL
230
2
0
19 Mar 2025
LEGION: Learning to Ground and Explain for Synthetic Image Detection
Hengrui Kang
Siwei Wen
Zichen Wen
Junyan Ye
Weijia Li
...
Baichuan Zhou
Bin Wang
Dahua Lin
Linfeng Zhang
Conghui He
97
6
0
19 Mar 2025
Sparseformer: a Transferable Transformer with Multi-granularity Token Sparsification for Medical Time Series Classification
Weiqi Zhang
Jiexia Ye
Zehan Li
Jiajun Li
Fugee Tsung
MedIm
73
0
0
19 Mar 2025
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models
Tengjin Weng
Jingyi Wang
Wenhao Jiang
Zhong Ming
VLM
LRM
84
0
0
19 Mar 2025
CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification
Wenlong Yu
Qilong Wang
Chuang Liu
Dong Li
Q. Hu
LRM
97
0
0
19 Mar 2025
EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?
Xinyan Chen
Jiaxin Ge
Hongming Dai
Qiang Zhou
Qiuxuan Feng
Jingtong Hu
Yun Wang
Jiaming Liu
Shanghang Zhang
LM&Ro
97
0
0
19 Mar 2025
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
Shuo Li
Jiajun Sun
Guodong Zheng
Xiaoran Fan
Yujiong Shen
...
Wenming Tan
Tao Ji
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
VLM
192
1
0
19 Mar 2025
Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes
Sarosij Bose
Arindam Dutta
Sayak Nag
Junge Zhang
Jiachen Li
Konstantinos Karydis
Amit K. Roy-Chowdhury
116
0
0
19 Mar 2025
TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models
Teng-Fang Hsiao
Bo-Kai Ruan
Yi-Lun Wu
Tzu-Ling Lin
Hong-Han Shuai
VLM
136
1
0
19 Mar 2025
A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition
Ritabrata Chakraborty
Shivakumara Palaiahnakote
Umapada Pal
Cheng-Lin Liu
VLM
110
0
0
19 Mar 2025
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
Siqi Zhang
Yanyuan Qiao
Qunbo Wang
Longteng Guo
Zhihua Wei
Qingbin Liu
LM&Ro
155
3
0
18 Mar 2025
Where do Large Vision-Language Models Look at when Answering Questions?
X. Xing
Chia-Wen Kuo
Li Fuxin
Yulei Niu
Fan Chen
Ming Li
Ying Wu
Longyin Wen
Sijie Zhu
LRM
119
1
0
18 Mar 2025
LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation
Yang Zhou
Shiyu Zhao
Yuxiao Chen
Zhenting Wang
Can Jin
Dimitris N. Metaxas
ObjD
145
0
0
18 Mar 2025
ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Vlad Hondru
Eduard Hogea
Darian M. Onchis
Radu Tudor Ionescu
147
2
0
18 Mar 2025
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
Donggon Jang
Yucheol Cho
Suin Lee
Taehyeon Kim
Dae-Shik Kim
VLM
91
3
0
18 Mar 2025
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
Umar Farooq
Jean-Yves Guillemaut
Adrian Hilton
M. Volino
3DGS
115
0
0
18 Mar 2025
EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models
Zongyun Zhang
Jiacheng Ruan
Xian Gao
Ting Liu
Yuzhuo Fu
126
2
0
18 Mar 2025
Can Large Vision Language Models Read Maps Like a Human?
Shuo Xing
Zezhou Sun
Shuangyu Xie
Kaiyuan Chen
Yanjia Huang
Yuping Wang
Jiachen Li
Dezhen Song
Zhengzhong Tu
142
8
0
18 Mar 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
...
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
84
0
0
18 Mar 2025
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Ziwei Wang
Weizhi Chen
Leyang Yang
Sheng Zhou
Shengchu Zhao
Hanbei Zhan
Jiongchao Jin
Liangcheng Li
Zirui Shao
Jiajun Bu
128
5
0
18 Mar 2025
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai
Shitian Zhao
Ming Li
Jike Zhong
Xiaofeng Yang
OffRL
LRM
LM&MA
VLM
186
31
0
18 Mar 2025
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Jing Zhang
108
3
0
18 Mar 2025
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving
Yujin Wang
Quanfeng Liu
Zhengxin Jiang
Tianyi Wang
Junfeng Jiao
Hongqing Chu
B. Gao
Hong Chen
124
5
0
18 Mar 2025
ChatBEV: A Visual Language Model that Understands BEV Maps
Qingyao Xu
Tian Jin
Guang Chen
Yanfeng Wang
Yize Zhang
70
1
0
18 Mar 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Sara Sarto
Marcella Cornia
Rita Cucchiara
86
1
0
18 Mar 2025
Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation
Sayak Nag
Udita Ghosh
Sarosij Bose
Calvin-Khang Ta
Jiachen Li
Amit K. Roy-Chowdhury
221
0
0
18 Mar 2025
Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic
Monika Shah
Somdeb Sarkhel
Deepak Venugopal
MLLM
BDL
VLM
127
0
0
18 Mar 2025
Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models
Xiaojun Jia
Sensen Gao
Simeng Qin
Ke Ma
Xianrui Li
Yihao Huang
Wei Dong
Yang Liu
Xiaochun Cao
AAML
VLM
120
2
0
17 Mar 2025
One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation
Daniil Selikhanovych
David Li
Aleksei Leonov
Nikita Gushchin
Sergei Kushneriuk
Alexander N. Filippov
Evgeny Burnaev
Iaroslav Koshelev
Alexander Korotin
DiffM
157
0
0
17 Mar 2025
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
Mingyang Song
Xiaoye Qu
Jiawei Zhou
Yu Cheng
VLM
170
1
0
17 Mar 2025
Previous
1
2
3
...
10
11
12
...
45
46
47
Next