Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
Junlin Han
Filippos Kokkinos
Philip Torr
VGen
141
42
0
18 Mar 2024
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model
Qi Zuo
Xiaodong Gu
Lingteng Qiu
Yuan Dong
Zhengyi Zhao
...
Rui Peng
Siyu Zhu
Zilong Dong
Liefeng Bo
Qixing Huang
DiffM
VGen
95
26
0
18 Mar 2024
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Sha Zhang
Di Huang
Jiajun Deng
Shixiang Tang
Wanli Ouyang
Tong He
Yanyong Zhang
VGen
66
18
0
18 Mar 2024
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Sivan Doveh
Jakub Micorek
Mateusz Koziñski
Hilde Kuhene
Horst Possegger
VLM
MLLM
99
14
0
18 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
128
121
0
18 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
82
11
0
18 Mar 2024
OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
131
0
0
18 Mar 2024
Continual Forgetting for Pre-trained Vision Models
Hongbo Zhao
Bolin Ni
Haochen Wang
Junsong Fan
Fei Zhu
Yuxi Wang
Yuntao Chen
Gaofeng Meng
Zhaoxiang Zhang
MU
VLM
136
13
0
18 Mar 2024
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Ruicheng Wang
Jianfeng Xiang
Jiaolong Yang
Xin Tong
DiffM
86
5
0
18 Mar 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLM
LLMAG
125
70
0
18 Mar 2024
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
Rao Fu
Jingyu Liu
Xilun Chen
Yixin Nie
Wenhan Xiong
LM&Ro
LRM
115
74
0
18 Mar 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
Siyuan Huang
Iaroslav Ponomarenko
Zhengkai Jiang
Xiaoqi Li
Xiaobin Hu
Peng Gao
Hongsheng Li
Hao Dong
LM&Ro
120
21
0
17 Mar 2024
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Paul S. Scotti
Mihir Tripathy
Cesar Kadir Torrico Villanueva
Reese Kneeland
Tong Chen
...
Charan Santhirasegaran
Jonathan Xu
Thomas Naselaris
Kenneth A. Norman
Tanishq Mathew Abraham
96
45
0
17 Mar 2024
Correcting misinformation on social media with a large language model
Xinyi Zhou
Ashish Sharma
Amy X. Zhang
Tim Althoff
KELM
87
5
0
17 Mar 2024
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
Zhe Kong
Yong Zhang
Tianyu Yang
Tao Wang
Kaihao Zhang
Bizhu Wu
Guanying Chen
Wei Liu
Wenhan Luo
DiffM
105
31
0
16 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
62
2
0
16 Mar 2024
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Tianhe Wu
Kede Ma
Jie Liang
Yujiu Yang
Lei Zhang
73
26
0
16 Mar 2024
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
Yizhi Song
Zhifei Zhang
Zhe Lin
Scott D. Cohen
Brian L. Price
Jianming Zhang
Soo Ye Kim
He Zhang
Wei Xiong
Daniel G. Aliaga
DiffM
102
41
0
15 Mar 2024
LightIt: Illumination Modeling and Control for Diffusion Models
Peter Kocsis
Julien Philip
Kalyan Sunkavalli
Matthias Nießner
Yannick Hold-Geoffroy
74
24
0
15 Mar 2024
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang
Yuhui Zhang
Orr Zohar
Serena Yeung-Levy
VLM
206
107
0
15 Mar 2024
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning
Yukun Li
Guansong Pang
Wei Suo
Chenchen Jing
Yuling Xi
Lingqiao Liu
Hao Chen
Guoqiang Liang
Peng Wang
CLL
VLM
81
8
0
15 Mar 2024
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang
Xiaojun Meng
Jianxin Liang
Yuxuan Wang
Qun Liu
Dongyan Zhao
73
34
0
15 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjD
VLM
82
20
0
15 Mar 2024
Autonomous Monitoring of Pharmaceutical R&D Laboratories with 6 Axis Arm Equipped Quadruped Robot and Generative AI: A Preliminary Study
Shunichi Hato
Nozomi Ogawa
64
1
0
15 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
74
1
0
15 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
135
3
0
15 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
76
3
0
14 Mar 2024
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
Minbin Huang
Yanxin Long
Xinchi Deng
Ruihang Chu
Jiangfeng Xiong
Xiaodan Liang
Hong Cheng
Qinglin Lu
Wei Liu
MLLM
EGVM
175
10
0
13 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh-Son To
Johan Verjans
128
14
0
12 Mar 2024
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning
Bingqian Lin
Yunshuang Nie
Ziming Wei
Jiaqi Chen
Shikui Ma
Jianhua Han
Hang Xu
Xiaojun Chang
Xiaodan Liang
LM&Ro
LRM
141
28
0
12 Mar 2024
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
Sergey Kastryulin
Denis Prokopenko
Artem Babenko
Dmitry V. Dylov
61
0
0
11 Mar 2024
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model
Junsu Kim
Yunhoe Ku
Jihyeon Kim
Junuk Cha
Seungryul Baek
ObjD
VLM
97
14
0
08 Mar 2024
Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models
Qiuhui Chen
Huping Ye
Yi Hong
MedIm
83
1
0
08 Mar 2024
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis
Mu-Hwa Chen
Yi Liu
Jian Yi
Changran Xu
Qiuxia Lai
Hongliang Wang
Tsung-Yi Ho
Qiang Xu
EGVM
82
10
0
08 Mar 2024
Large Language Models are In-Context Molecule Learners
Jiatong Li
Wei Liu
Zhihao Ding
Wenqi Fan
Yuqiang Li
Qing Li
120
6
0
07 Mar 2024
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Zhongwei Wan
Che Liu
Xin Wang
Chaofan Tao
Hui Shen
Zhenwu Peng
Jie Fu
Rossella Arcucci
Huaxiu Yao
106
10
0
07 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
86
15
0
06 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLM
ObjD
115
36
0
04 Mar 2024
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Maksim Kuprashevich
Grigorii Alekseenko
Irina Tolstykh
ELM
153
6
0
04 Mar 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
Frank Breitinger
Mark Scanlon
150
10
0
29 Feb 2024
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
97
10
0
28 Feb 2024
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Xiujie Song
Mengyue Wu
Ke Zhu
Chunhao Zhang
Yanyi Chen
LRM
ELM
134
3
0
28 Feb 2024
VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen
Yurun Song
Siwei Wu
Rui Xia
113
6
0
27 Feb 2024
Diffusion Model-Based Image Editing: A Survey
Yi Huang
Jiancheng Huang
Yifan Liu
Mingfu Yan
Jiaxi Lv
Jianzhuang Liu
Wei Xiong
He Zhang
Liangliang Cao
Liangliang Cao
EGVM
263
103
0
27 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
102
7
0
25 Feb 2024
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
Yiyang Zhou
Chenhang Cui
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
VLM
MLLM
120
121
0
18 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan Yuille
Shen Yan
Boyu Wang
48
3
0
16 Feb 2024
OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models
Yuxuan Kuang
Hai Lin
Meng Jiang
LM&Ro
103
33
0
16 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
87
67
0
14 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLM
MLLM
59
12
0
13 Feb 2024
Previous
1
2
3
...
42
43
44
45
46
47
Next