Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,352 papers shown
Title
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering
Danfeng Guo
Sumitaka Honji
LRM
170
2
0
31 Jul 2024
Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration
Ngoc Son Nguyen
Van Nguyen
Tung Le
ViT
91
1
0
30 Jul 2024
UniProcessor: A Text-induced Unified Low-level Image Processor
Huiyu Duan
Xiongkuo Min
Sijing Wu
Wei Shen
Guangtao Zhai
DiffM
75
12
0
30 Jul 2024
SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition
Hao Tan
Zichang Tan
Jun Li
Jun Wan
Zhen Lei
Stan Z. Li
VLM
88
1
0
30 Jul 2024
Interpreting and Mitigating Hallucination in MLLMs through Multi-agent Debate
Zheng Lin
Zhenxing Niu
Zhibin Wang
Yinghui Xu
78
8
0
30 Jul 2024
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models
Junda Wu
Xintong Li
Tong Yu
Yu Wang
Xiang Chen
Jiuxiang Gu
Lina Yao
Jingbo Shang
Julian McAuley
75
2
0
29 Jul 2024
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
80
7
0
29 Jul 2024
Specify and Edit: Overcoming Ambiguity in Text-Based Image Editing
Ekaterina Iakovleva
Fabio Pizzati
Philip Torr
Stéphane Lathuiliere
DiffM
96
0
0
29 Jul 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
125
5
0
29 Jul 2024
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning
Xingchen Zeng
Haichuan Lin
Yilin Ye
Wei Zeng
98
17
0
29 Jul 2024
ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
Wenjun Huang
Jiakai Pan
Jiahao Tang
Yanyu Ding
Yifei Xing
Yuhe Wang
Zhengzhuo Wang
Jianguo Hu
Mamba
107
8
0
29 Jul 2024
Urban Safety Perception Assessments via Integrating Multimodal Large Language Models with Street View Images
Jiaxin Zhanga
Yunqin Lia
Tomohiro Fukudab
Bowen Wang
76
1
0
29 Jul 2024
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
Nitzan Bitton-Guetta
Aviv Slobodkin
Aviya Maimon
Eliya Habba
Royi Rassin
Yonatan Bitton
Idan Szpektor
Amir Globerson
Yuval Elovici
ReLM
VLM
LRM
83
6
0
28 Jul 2024
LLAVADI: What Matters For Multimodal Large Language Models Distillation
Shilin Xu
Xiangtai Li
Haobo Yuan
Lu Qi
Yunhai Tong
Ming-Hsuan Yang
73
4
0
28 Jul 2024
Multi-Modal CLIP-Informed Protein Editing
Mingze Yin
Hanjing Zhou
Yiheng Zhu
Miao Lin
YiXuan Wu
...
Hongxia Xu
Chang-Yu Hsieh
Tingjun Hou
Jintai Chen
Jian Wu
93
7
0
27 Jul 2024
Large Language Models for Human-like Autonomous Driving: A Survey
Yun Li
Kai Katsumata
Ehsan Javanmardi
Manabu Tsukada
LM&MA
88
11
0
27 Jul 2024
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang
Yufan Zhou
Jian Chen
Jiuxiang Gu
Changyou Chen
Tongfei Sun
VLM
56
6
0
27 Jul 2024
VACoDe: Visual Augmented Contrastive Decoding
Sihyeon Kim
Boryeong Cho
Sangmin Bae
Sumyeong Ahn
SeYoung Yun
73
4
0
26 Jul 2024
SWIFT: Semantic Watermarking for Image Forgery Thwarting
Gautier Evennou
Vivien Chappelier
Ewa Kijak
Teddy Furon
80
2
0
26 Jul 2024
Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models
Xiang Shi
Jiawei Liu
Yinpeng Liu
Qikai Cheng
Wei Lu
78
0
0
26 Jul 2024
Towards Localized Fine-Grained Control for Facial Expression Generation
Tuomas Varanka
Huai-Qian Khor
Yante Li
Mengting Wei
Hanwei Kung
N. Sebe
Guoying Zhao
99
4
0
25 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng Zhang
Tong He
VLM
59
4
0
25 Jul 2024
Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning
Vedanshu
M. M. Tripathi
Bhavnesh Jaint
MLLM
VLM
57
0
0
25 Jul 2024
Cost-effective Instruction Learning for Pathology Vision and Language Analysis
Kaitao Chen
Mianxin Liu
Fang Yan
Lei Ma
Xiaoming Shi
...
Xiaosong Wang
Lifeng Zhu
Zhe Wang
Mu Zhou
Shaoting Zhang
102
4
0
25 Jul 2024
Diffusion Models for Multi-Task Generative Modeling
Changyou Chen
Han Ding
Bunyamin Sisman
Yi Tian Xu
Ouye Xie
Benjamin Z. Yao
Son Dinh Tran
Belinda Zeng
DiffM
93
5
0
24 Jul 2024
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel
Shyamgopal Karthik
Mariana-Iuliana Georgescu
Zeynep Akata
EgoV
158
7
0
23 Jul 2024
Harmonizing Visual Text Comprehension and Generation
Zhen Zhao
Jingqun Tang
Binghong Wu
Chunhui Lin
Shubo Wei
Hao Liu
Xin Tan
Zhizhong Zhang
Can Huang
Yuan Xie
VLM
107
26
0
23 Jul 2024
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
Yiwei Ma
Zhibin Wang
Xiaoshuai Sun
Weihuang Lin
Qiang-feng Zhou
Jiayi Ji
Rongrong Ji
MLLM
VLM
110
2
0
23 Jul 2024
Improved Few-Shot Image Classification Through Multiple-Choice Questions
Dipika Khullar
Emmett Goodman
Negin Sokhandan
61
0
0
23 Jul 2024
SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation
Pengfei Chen
Lingxi Xie
Xinyue Huo
Xuehui Yu
Xiaopeng Zhang
Yingfei Sun
Zhenjun Han
Qi Tian
VLM
202
1
0
23 Jul 2024
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Mingze Xu
Mingfei Gao
Zhe Gan
Hong-You Chen
Zhengfeng Lai
Haiming Gang
Kai Kang
Afshin Dehghan
112
61
0
22 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
116
29
0
22 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
75
2
0
22 Jul 2024
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu
Dongxu Li
Bei Chen
Junnan Li
105
165
0
22 Jul 2024
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning
Zhecan Wang
Garrett Bingham
Adams Wei Yu
Quoc V. Le
Thang Luong
Golnaz Ghiasi
MLLM
LRM
137
13
0
22 Jul 2024
Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models
Thinesh Thiyakesan Ponbagavathi
Kunyu Peng
Alina Roitberg
83
1
0
22 Jul 2024
In-Context Learning Improves Compositional Understanding of Vision-Language Models
Matteo Nulli
Anesa Ibrahimi
Avik Pal
Hoshe Lee
Ivona Najdenkoska
VLM
CoGe
77
0
0
22 Jul 2024
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Kent Fujiwara
Mikihiro Tanaka
Qing Yu
89
2
0
22 Jul 2024
VideoGameBunny: Towards vision assistants for video games
Mohammad Reza Taesiri
Cor-Paul Bezemer
VLM
MLLM
81
2
0
21 Jul 2024
Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation
Yu Zhang
Ruijie Yu
Kaipeng Zeng
Ding Li
Feng Zhu
Xiaokang Yang
Yaohui Jin
Yanyan Xu
63
2
0
21 Jul 2024
Navigation Instruction Generation with BEV Perception and Large Language Models
Sheng Fan
Rui Liu
Wenguan Wang
Yi Yang
94
9
0
21 Jul 2024
MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery
Pei Zhou
Yanchao Yang
79
1
0
21 Jul 2024
LSReGen: Large-Scale Regional Generator via Backward Guidance Framework
Bowen Zhang
Cheng Yang
Xuanhui Liu
DiffM
79
0
0
21 Jul 2024
OpenSU3D: Open World 3D Scene Understanding using Foundation Models
Rafay Mohiuddin
Sai Manoj Prakhya
Fiona Collins
Ziyuan Liu
André Borrmann
51
2
0
19 Jul 2024
Visual Text Generation in the Wild
Yuanzhi Zhu
Jiawei Liu
Feiyu Gao
Wenyu Liu
Xinggang Wang
Peng Wang
Fei Huang
Cong Yao
Zhibo Yang
DiffM
100
11
0
19 Jul 2024
Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance
Yongshuo Zhu
Lu Li
Keyan Chen
Chenyang Liu
Fugen Zhou
Z. Shi
78
4
0
19 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
113
7
0
18 Jul 2024
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
S. Swetha
Jinyu Yang
T. Neiman
Mamshad Nayeem Rizve
Son Tran
Benjamin Z. Yao
Trishul Chilimbi
Mubarak Shah
112
2
0
18 Jul 2024
SegPoint: Segment Any Point Cloud via Large Language Model
Shuting He
Henghui Ding
Xudong Jiang
Bihan Wen
3DV
MLLM
3DPC
90
19
0
18 Jul 2024
Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM
Baicheng Li
Zike Yan
Dong Wu
Hanqing Jiang
Hongbin Zha
CLL
57
1
0
18 Jul 2024
Previous
1
2
3
...
28
29
30
...
46
47
48
Next