Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
Vision-Integrated LLMs for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation
Namhee Kim
Woojin Park
139
0
0
06 Feb 2025
DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation
Luciano Baresi
Davide Yi Xian Hu
Muhammad Irfan Masúdi
G. Quattrocchi
DiffM
VLM
179
1
0
05 Feb 2025
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang
Shivaram Venkataraman
VLM
561
0
0
04 Feb 2025
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Ahmed Masry
Juan A. Rodriguez
Tianyu Zhang
Suyuchen Wang
Chao Wang
...
I. Laradji
David Vazquez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
96
0
0
03 Feb 2025
AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis
B. Alawode
I. I. Ganapathi
S. Javed
Naoufel Werghi
Mohammed Bennamoun
Arif Mahmood
CLIP
VLM
110
1
0
03 Feb 2025
Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection
Tianlin Zhang
En Yu
Yi Shao
Shuai Li
181
0
0
03 Feb 2025
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
Aashish Rai
Dilin Wang
Mihir Jain
N. Sarafianos
Arthur Chen
Srinath Sridhar
Aayush Prakash
3DGS
184
1
0
03 Feb 2025
Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models
Mingi Jung
Saehuyng Lee
Eunji Kim
Sungroh Yoon
559
2
0
03 Feb 2025
BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Divya J. Bajpai
M. Hanawal
156
1
0
02 Feb 2025
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
Yuxin Lin
Mengshi Qi
Liang Liu
Huadong Ma
CLL
80
2
0
02 Feb 2025
VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
Chunbai Zhang
Chao Wang
Yang Zhou
Yan Peng
LRM
ReLM
154
0
0
02 Feb 2025
Leveraging Stable Diffusion for Monocular Depth Estimation via Image Semantic Encoding
Jingming Xia
Guanqun Cao
Guang Ma
Yiben Luo
Qinzhao Li
John Oyekan
MDE
139
0
0
01 Feb 2025
Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
Xingyu Miao
Haoran Duan
Yang Bai
Tejal Shah
Jun Song
Yang Long
R. Ranjan
Ling Shao
161
5
0
31 Jan 2025
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
Bin Zhu
Hui yan Qi
Yinxuan Gui
Jingjing Chen
Chong-Wah Ngo
Ee-Peng Lim
439
2
0
31 Jan 2025
Continually Evolved Multimodal Foundation Models for Cancer Prognosis
Jie Peng
Shuang Zhou
Longwei Yang
Yiran Song
Mohan Zhang
Kaixiong Zhou
Feng Xie
Mingquan Lin
Rui Zhang
Tianlong Chen
213
0
0
30 Jan 2025
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Lin Chen
Qi Yang
Kun Ding
Zhu Li
Gang Shen
Fei Li
Qiyuan Cao
Shiming Xiang
VLM
80
0
0
29 Jan 2025
Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement
Kei Katsumata
Motonari Kambara
Daichi Yashima
Ryosuke Korekata
Komei Sugiura
200
0
0
28 Jan 2025
Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings
Hossein Mirzaei
Mackenzie W. Mathis
OODD
AAML
129
6
0
28 Jan 2025
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
Ke-Han Lu
Zhehuai Chen
Szu-Wei Fu
Chao-Han Huck Yang
Jagadeesh Balam
Boris Ginsburg
Yu-Te Wang
Hung-yi Lee
AuLLM
SyDa
170
16
0
28 Jan 2025
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin
Mingbao Lin
Luxi Lin
Rongrong Ji
108
24
0
28 Jan 2025
Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge
Anh-Kiet Duong
Petra Gomez-Krämer
133
2
0
27 Jan 2025
PreciseCam: Precise Camera Control for Text-to-Image Generation
Edurne Bernal-Berdun
Ana Serrano
B. Masiá
Matheus Gadelha
Yannick Hold-Geoffroy
Xin Sun
Diego F. F. Gutierrez
DiffM
VGen
102
1
0
22 Jan 2025
Patent Figure Classification using Large Vision-language Models
Sushil Awale
Eric Müller-Budack
Ralph Ewerth
64
0
0
22 Jan 2025
Triplet Synthesis For Enhancing Composed Image Retrieval via Counterfactual Image Generation
Kenta Uesugi
Naoki Saito
Keisuke Maeda
Takahiro Ogawa
Miki Haseyama
75
0
0
22 Jan 2025
Parallel Sequence Modeling via Generalized Spatial Propagation Network
Hongjun Wang
Wonmin Byeon
Jiarui Xu
Liang Feng
Ka Chun Cheung
Xiaolong Wang
Kai Han
Jan Kautz
Sifei Liu
408
0
0
21 Jan 2025
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
MLLM
VLM
LRM
170
11
0
21 Jan 2025
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection
Yuanze Li
Haolin Wang
Shihao Yuan
Ming-Yu Liu
Debin Zhao
Yiwen Guo
Chen Xu
Guangming Shi
Wangmeng Zuo
162
33
0
20 Jan 2025
Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding
Kohei Torimi
Ryosuke Yamada
Daichi Otsuka
Kensho Hara
Yuki M. Asano
Hirokatsu Kataoka
Y. Aoki
3DV
138
0
0
20 Jan 2025
Dynamic Scene Understanding from Vision-Language Representations
Shahaf Pruss
Morris Alper
Hadar Averbuch-Elor
OCL
507
0
0
20 Jan 2025
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol
Trilok Padhi
Agnik Saha
Ugur Kursuncu
Mehmet Emin Aktas
94
2
0
17 Jan 2025
DriveLM: Driving with Graph Visual Question Answering
Chonghao Sima
Katrin Renz
Kashyap Chitta
Lawrence Yunliang Chen
Hanxue Zhang
Chengen Xie
Jens Beißwenger
Ping Luo
Andreas Geiger
Hongyang Li
295
208
0
17 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
286
27
0
17 Jan 2025
PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements
Xueyan Li
Xinyan Chen
Yazhe Niu
Shuai Hu
Yu Liu
OffRL
141
3
0
17 Jan 2025
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Sitong Gong
Yunzhi Zhuge
Lu Zhang
Zhiyong Yang
Pingping Zhang
Huchuan Lu
95
3
0
15 Jan 2025
3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding
Haomiao Xiong
Yunzhi Zhuge
Jiawen Zhu
Lu Zhang
Huchuan Lu
79
3
0
14 Jan 2025
IP-FaceDiff: Identity-Preserving Facial Video Editing with Diffusion
Tharun Anand
Aryan Garg
Kaushik Mitra
VGen
DiffM
94
0
0
13 Jan 2025
RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment
Difei Gu
Yunhe Gao
Yang Zhou
Mu Zhou
Dimitris N. Metaxas
LM&MA
79
3
0
13 Jan 2025
Initial Findings on Sensor based Open Vocabulary Activity Recognition via Text Embedding Inversion
L. Ray
Bo Zhou
Sungho Suh
P. Lukowicz
VLM
91
0
0
13 Jan 2025
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
Y. Ranasinghe
Vibashan Vs
James Uplinger
C. D. Melo
Vishal M. Patel
66
0
0
13 Jan 2025
TimeLogic: A Temporal Logic Benchmark for Video QA
S. Swetha
Hilde Kuehne
Mubarak Shah
64
1
0
13 Jan 2025
LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models
Mozhgan Nasr Azadani
James Riddell
Sean Sedwards
Krzysztof Czarnecki
MLLM
VLM
87
3
0
13 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
Dahua Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
254
134
0
10 Jan 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan Yuille
Lu Jiang
VGen
124
2
0
10 Jan 2025
Instructive3D: Editing Large Reconstruction Models with Text Instructions
Kunal Kathare
Ankit Dhiman
K Vikas Gowda
Siddharth Aravindan
Shubham Monga
Basavaraja Shanthappa Vandrotti
Lokesh R. Boregowda
DiffM
76
2
0
08 Jan 2025
SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild
Jiawei Liu
Yuanzhi Zhu
Feiyu Gao
Zhiyong Yang
P. Wang
Junyang Lin
Xinyu Wang
Wenyu Liu
DiffM
94
0
0
08 Jan 2025
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan
Jiyao Zhang
Tianshu Wu
Yinghao Zhao
Wenlong Gao
Hao Dong
LM&Ro
117
13
0
08 Jan 2025
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Yuzhou Huang
Ziyang Yuan
Quande Liu
Qiulin Wang
Xintao Wang
Ruimao Zhang
Pengfei Wan
Di Zhang
Kun Gai
VGen
DiffM
154
16
0
08 Jan 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming-Hsuan Yang
VLM
195
25
0
07 Jan 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLM
VLM
168
43
0
07 Jan 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong
Yean Cheng
Zhiyong Yang
Weihan Wang
Lefan Wang
Xiaotao Gu
Shiyu Huang
Yuxiao Dong
J. Tang
CoGe
VLM
156
5
0
06 Jan 2025
Previous
1
2
3
...
15
16
17
...
45
46
47
Next