ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,345 papers shown
Title
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion
  Models
VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models
Junlin Han
Filippos Kokkinos
Philip Torr
VGen
141
42
0
18 Mar 2024
VideoMV: Consistent Multi-View Generation Based on Large Video
  Generative Model
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model
Qi Zuo
Xiaodong Gu
Lingteng Qiu
Yuan Dong
Zhengyi Zhao
...
Rui Peng
Siyu Zhu
Zilong Dong
Liefeng Bo
Qixing Huang
DiffMVGen
95
26
0
18 Mar 2024
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Agent3D-Zero: An Agent for Zero-shot 3D Understanding
Sha Zhang
Di Huang
Jiajun Deng
Shixiang Tang
Wanli Ouyang
Tong He
Yanyong Zhang
VGen
66
18
0
18 Mar 2024
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Sivan Doveh
Jakub Micorek
Mateusz Koziñski
Hilde Kuhene
Horst Possegger
VLMMLLM
99
14
0
18 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLMMLLM
128
121
0
18 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
82
11
0
18 Mar 2024
OCR is All you need: Importing Multi-Modality into Image-based Defect
  Detection System
OCR is All you need: Importing Multi-Modality into Image-based Defect Detection System
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
131
0
0
18 Mar 2024
Continual Forgetting for Pre-trained Vision Models
Continual Forgetting for Pre-trained Vision Models
Hongbo Zhao
Bolin Ni
Haochen Wang
Junsong Fan
Fei Zhu
Yuxi Wang
Yuntao Chen
Gaofeng Meng
Zhaoxiang Zhang
MUVLM
136
13
0
18 Mar 2024
Diffusion Models are Geometry Critics: Single Image 3D Editing Using
  Pre-Trained Diffusion Priors
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Ruicheng Wang
Jianfeng Xiang
Jiaolong Yang
Xin Tong
DiffM
86
5
0
18 Mar 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLMLLMAG
125
70
0
18 Mar 2024
Scene-LLM: Extending Language Model for 3D Visual Understanding and
  Reasoning
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning
Rao Fu
Jingyu Liu
Xilun Chen
Yixin Nie
Wenhan Xiong
LM&RoLRM
115
74
0
18 Mar 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded
  Information into Multi-Modal Large Language Models
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
Siyuan Huang
Iaroslav Ponomarenko
Zhengkai Jiang
Xiaoqi Li
Xiaobin Hu
Peng Gao
Hongsheng Li
Hao Dong
LM&Ro
120
21
0
17 Mar 2024
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
Paul S. Scotti
Mihir Tripathy
Cesar Kadir Torrico Villanueva
Reese Kneeland
Tong Chen
...
Charan Santhirasegaran
Jonathan Xu
Thomas Naselaris
Kenneth A. Norman
Tanishq Mathew Abraham
96
45
0
17 Mar 2024
Correcting misinformation on social media with a large language model
Correcting misinformation on social media with a large language model
Xinyi Zhou
Ashish Sharma
Amy X. Zhang
Tim Althoff
KELM
87
5
0
17 Mar 2024
OMG: Occlusion-friendly Personalized Multi-concept Generation in
  Diffusion Models
OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models
Zhe Kong
Yong Zhang
Tianyu Yang
Tao Wang
Kaihao Zhang
Bizhu Wu
Guanying Chen
Wei Liu
Wenhan Luo
DiffM
105
31
0
16 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for
  Remote Sensing Image-Text Retrival
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
62
2
0
16 Mar 2024
A Comprehensive Study of Multimodal Large Language Models for Image
  Quality Assessment
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Tianhe Wu
Kede Ma
Jie Liang
Yujiu Yang
Lei Zhang
73
26
0
16 Mar 2024
IMPRINT: Generative Object Compositing by Learning Identity-Preserving
  Representation
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
Yizhi Song
Zhifei Zhang
Zhe Lin
Scott D. Cohen
Brian L. Price
Jianming Zhang
Soo Ye Kim
He Zhang
Wei Xiong
Daniel G. Aliaga
DiffM
102
41
0
15 Mar 2024
LightIt: Illumination Modeling and Control for Diffusion Models
LightIt: Illumination Modeling and Control for Diffusion Models
Peter Kocsis
Julien Philip
Kalyan Sunkavalli
Matthias Nießner
Yannick Hold-Geoffroy
74
24
0
15 Mar 2024
VideoAgent: Long-form Video Understanding with Large Language Model as
  Agent
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang
Yuhui Zhang
Orr Zohar
Serena Yeung-Levy
VLM
206
107
0
15 Mar 2024
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and
  Vocabulary Learning
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning
Yukun Li
Guansong Pang
Wei Suo
Chenchen Jing
Yuling Xi
Lingqiao Liu
Hao Chen
Guoqiang Liang
Peng Wang
CLLVLM
81
8
0
15 Mar 2024
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang
Xiaojun Meng
Jianxin Liang
Yuxuan Wang
Qun Liu
Dongyan Zhao
73
34
0
15 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjDVLM
82
20
0
15 Mar 2024
Autonomous Monitoring of Pharmaceutical R&D Laboratories with 6 Axis Arm
  Equipped Quadruped Robot and Generative AI: A Preliminary Study
Autonomous Monitoring of Pharmaceutical R&D Laboratories with 6 Axis Arm Equipped Quadruped Robot and Generative AI: A Preliminary Study
Shunichi Hato
Nozomi Ogawa
64
1
0
15 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
74
1
0
15 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
135
3
0
15 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
76
3
0
14 Mar 2024
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
Minbin Huang
Yanxin Long
Xinchi Deng
Ruihang Chu
Jiangfeng Xiong
Xiaodan Liang
Hong Cheng
Qinglin Lu
Wei Liu
MLLMEGVM
175
10
0
13 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A
  Multi-Aspect Vision-Language Pre-training Framework
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh-Son To
Johan Verjans
128
14
0
12 Mar 2024
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning
Bingqian Lin
Yunshuang Nie
Ziming Wei
Jiaqi Chen
Shikui Ma
Jianhua Han
Hang Xu
Xiaojun Chang
Xiaodan Liang
LM&RoLRM
141
28
0
12 Mar 2024
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
Sergey Kastryulin
Denis Prokopenko
Artem Babenko
Dmitry V. Dylov
61
0
0
11 Mar 2024
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object
  Detection via Vision-Language Model
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model
Junsu Kim
Yunhoe Ku
Jihyeon Kim
Junuk Cha
Seungryul Baek
ObjDVLM
97
14
0
08 Mar 2024
Med3DInsight: Enhancing 3D Medical Image Understanding with 2D
  Multi-Modal Large Language Models
Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models
Qiuhui Chen
Huping Ye
Yi Hong
MedIm
83
1
0
08 Mar 2024
Evaluating Text-to-Image Generative Models: An Empirical Study on Human
  Image Synthesis
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis
Mu-Hwa Chen
Yi Liu
Jian Yi
Changran Xu
Qiuxia Lai
Hongliang Wang
Tsung-Yi Ho
Qiang Xu
EGVM
82
10
0
08 Mar 2024
Large Language Models are In-Context Molecule Learners
Large Language Models are In-Context Molecule Learners
Jiatong Li
Wei Liu
Zhihao Ding
Wenqi Fan
Yuqiang Li
Qing Li
120
6
0
07 Mar 2024
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Zhongwei Wan
Che Liu
Xin Wang
Chaofan Tao
Hui Shen
Zhenwu Peng
Jie Fu
Rossella Arcucci
Huaxiu Yao
106
10
0
07 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
86
15
0
06 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language
  Models without Training
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLMObjD
115
36
0
04 Mar 2024
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Maksim Kuprashevich
Grigorii Alekseenko
Irina Tolstykh
ELM
153
6
0
04 Mar 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
Frank Breitinger
Mark Scanlon
150
10
0
29 Feb 2024
Grounding Language Models for Visual Entity Recognition
Grounding Language Models for Visual Entity Recognition
Zilin Xiao
Ming Gong
Paola Cascante-Bonilla
Xingyao Zhang
Jie Wu
Vicente Ordonez
VLM
97
10
0
28 Feb 2024
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Xiujie Song
Mengyue Wu
Ke Zhu
Chunhao Zhang
Yanyi Chen
LRMELM
134
3
0
28 Feb 2024
VCD: A Dataset for Visual Commonsense Discovery in Images
VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen
Yurun Song
Siwei Wu
Rui Xia
113
6
0
27 Feb 2024
Diffusion Model-Based Image Editing: A Survey
Diffusion Model-Based Image Editing: A Survey
Yi Huang
Jiancheng Huang
Yifan Liu
Mingfu Yan
Jiaxi Lv
Jianzhuang Liu
Wei Xiong
He Zhang
Liangliang Cao
Liangliang Cao
EGVM
263
103
0
27 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
102
7
0
25 Feb 2024
Aligning Modalities in Vision Large Language Models via Preference
  Fine-tuning
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
Yiyang Zhou
Chenhang Cui
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
VLMMLLM
120
121
0
18 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
  Vision-language Adapter
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan Yuille
Shen Yan
Boyu Wang
48
3
0
16 Feb 2024
OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via
  Vision-Language Foundation Models
OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models
Yuxuan Kuang
Hai Lin
Meng Jiang
LM&Ro
103
33
0
16 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for
  Medical LVLM
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELMLM&MA
87
67
0
14 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLMMLLM
59
12
0
13 Feb 2024
Previous
123...424344454647
Next