ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,345 papers shown
Title
LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges
LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges
Leuson Da Silva
Jordan Samhi
Foutse Khomh
SILMALMAI4MHELM
97
10
0
13 Feb 2024
Domain Adaptable Fine-Tune Distillation Framework For Advancing Farm
  Surveillance
Domain Adaptable Fine-Tune Distillation Framework For Advancing Farm Surveillance
Raza Imam
Muhammad Huzaifa
Nabil Mansour
Shaher Bano Mirza
Fouad Lamghari
121
1
0
10 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
175
7
0
08 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Hongsheng Li
Yu Qiao
Peng Gao
MLLM
237
116
0
08 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
130
6
0
08 Feb 2024
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Kun Ouyang
Liqiang Jing
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
192
3
0
06 Feb 2024
Multimodal Rationales for Explainable Visual Question Answering
Multimodal Rationales for Explainable Visual Question Answering
Kun Li
G. Vosselman
Michael Ying Yang
132
2
0
06 Feb 2024
When Large Language Models Meet Vector Databases: A Survey
When Large Language Models Meet Vector Databases: A Survey
Zhi Jing
Yongye Su
Yikun Han
Bo Yuan
Haiyun Xu
Chunjiang Liu
Kehai Chen
Min Zhang
138
38
0
30 Jan 2024
LanDA: Language-Guided Multi-Source Domain Adaptation
LanDA: Language-Guided Multi-Source Domain Adaptation
Zhenbin Wang
Lei Zhang
Lituan Wang
Minjuan Zhu
87
10
0
25 Jan 2024
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question
  Understanding and Reasoning
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
Zheqi He
Xinya Wu
Pengfei Zhou
Richeng Xuan
Guang Liu
Xi Yang
Qiannan Zhu
Hua Huang
ELMLRM
108
20
0
25 Jan 2024
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
Ci-Siang Lin
Chien-Yi Wang
Yu-Chiang Frank Wang
Min-Hung Chen
VLM
248
0
0
22 Jan 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
78
0
0
22 Jan 2024
Q&A Prompts: Discovering Rich Visual Clues through Mining
  Question-Answer Prompts for VQA requiring Diverse World Knowledge
Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge
Haibi Wang
Weifeng Ge
LRM
108
4
0
19 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLMLLMAG
145
20
0
19 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&RoLLMAG
181
41
0
16 Jan 2024
Discriminative Consensus Mining with A Thousand Groups for More Accurate
  Co-Salient Object Detection
Discriminative Consensus Mining with A Thousand Groups for More Accurate Co-Salient Object Detection
Peng Zheng
72
0
0
15 Jan 2024
Low-Resource Vision Challenges for Foundation Models
Low-Resource Vision Challenges for Foundation Models
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
94
7
0
09 Jan 2024
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos
E. Peruzzo
Vidit Goel
Dejia Xu
Xingqian Xu
Yi Ding
Zhangyang Wang
Humphrey Shi
N. Sebe
LM&RoVGenDiffM
122
12
0
04 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
222
100
0
29 Dec 2023
LISA++: An Improved Baseline for Reasoning Segmentation with Large
  Language Model
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
Senqiao Yang
Tianyuan Qu
Xin Lai
Zhuotao Tian
Bohao Peng
Shu Liu
Jiaya Jia
VLM
120
32
0
28 Dec 2023
EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
T. Dao
D. Vu
Cuong Pham
Anh Tran
87
1
0
28 Dec 2023
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
Zhihao Chen
Bin Hu
Chuang Niu
Tao Chen
Yuxin Li
Hongming Shan
Ge Wang
LM&MAMLLM
66
4
0
25 Dec 2023
MetaAID 2.5: A Secure Framework for Developing Metaverse Applications
  via Large Language Models
MetaAID 2.5: A Secure Framework for Developing Metaverse Applications via Large Language Models
Hongyin Zhu
80
6
0
22 Dec 2023
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
  Understanding
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
Senqiao Yang
Jiaming Liu
Ray Zhang
Mingjie Pan
Zoey Guo
Xiaoqi Li
Zehui Chen
Peng Gao
Yandong Guo
Shanghang Zhang
3DV
108
71
0
21 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLMMLLM
167
36
0
19 Dec 2023
Mask Grounding for Referring Image Segmentation
Mask Grounding for Referring Image Segmentation
Yong Xien Chng
Henry Zheng
Yizeng Han
Xuchong Qiu
Gao Huang
ISegObjD
141
21
0
19 Dec 2023
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition
Tianlin Li
Yao Rong
Shiao Wang
Yuan Chen
Zhe Wu
Bowei Jiang
Yonghong Tian
Jin Tang
ViT
158
3
0
18 Dec 2023
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
Hao Shao
Yuxuan Hu
Letian Wang
Steven L. Waslander
Yu Liu
Hongsheng Li
ELM
115
138
0
12 Dec 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
MLLM
104
133
0
11 Dec 2023
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction
  Following
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following
Shufan Li
Harkanwar Singh
Aditya Grover
DiffM
93
10
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
81
6
0
11 Dec 2023
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Zhen Li
Mingdeng Cao
Xintao Wang
Zhongang Qi
Ming-Ming Cheng
Ying Shan
DiffM
138
201
0
07 Dec 2023
UFineBench: Towards Text-based Person Retrieval with Ultra-fine
  Granularity
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
Jia-li Zuo
Hanyu Zhou
Ying Nie
Feng Zhang
Tianyu Guo
Nong Sang
Yunhe Wang
Changxin Gao
133
23
0
06 Dec 2023
UPOCR: Towards Unified Pixel-Level OCR Interface
UPOCR: Towards Unified Pixel-Level OCR Interface
Dezhi Peng
Zhenhua Yang
Jiaxin Zhang
Chongyu Liu
Yongxin Shi
Kai Ding
Fengjun Guo
Lianwen Jin
127
11
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation
  Extraction
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
91
1
0
05 Dec 2023
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
  Grounding with Multimodal Large Language Model
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model
Guozhang Li
Xinpeng Ding
De Cheng
Jie Li
Nannan Wang
Xinbo Gao
100
1
0
05 Dec 2023
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Zhangyang Qi
Ye Fang
Zeyi Sun
Xiaoyang Wu
Tong Wu
Jiaqi Wang
Dahua Lin
Hengshuang Zhao
MLLM
184
36
0
05 Dec 2023
StoryGPT-V: Large Language Models as Consistent Story Visualizers
StoryGPT-V: Large Language Models as Consistent Story Visualizers
Xiaoqian Shen
Mohamed Elhoseiny
VLM
202
11
0
04 Dec 2023
Effectively Fine-tune to Improve Large Multimodal Models for Radiology
  Report Generation
Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation
Yuzhe Lu
Sungmin Hong
Yash Shah
Panpan Xu
LM&MAMedIm
64
7
0
03 Dec 2023
LVDiffusor: Distilling Functional Rearrangement Priors from Large Models
  into Diffusor
LVDiffusor: Distilling Functional Rearrangement Priors from Large Models into Diffusor
Yiming Zeng
Mingdong Wu
Long Yang
Jiyao Zhang
Hao Ding
Hui Cheng
Hao Dong
DiffM
71
8
0
03 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLMVLM
85
11
0
03 Dec 2023
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
  Internet of Things
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things
Yaoyao Zhong
Mengshi Qi
Rui Wang
Yuhan Qiu
Yang Zhang
Huadong Ma
73
2
0
01 Dec 2023
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
  Models
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
Zhen Xing
Qi Dai
Zihao Zhang
Hui Zhang
Hang-Rui Hu
Zuxuan Wu
Yu-Gang Jiang
VGen
102
17
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
152
2
0
30 Nov 2023
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context
  Learning
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang
Kevin Qinghong Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
109
32
0
29 Nov 2023
Rethinking Image Editing Detection in the Era of Generative AI
  Revolution
Rethinking Image Editing Detection in the Era of Generative AI Revolution
Zhihao Sun
Haipeng Fang
Xinying Zhao
Danding Wang
Juan Cao
91
10
0
29 Nov 2023
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation
  and Editing
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
Haoyu Zhao
Tianyi Lu
Jiaxi Gu
Xing Zhang
Qingping Zheng
Zuxuan Wu
Hang Xu
Yu-Gang Jiang
VGenDiffM
119
12
0
29 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot
  Classifiers
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Alan Yuille
Jiahui Yu
VLM
54
3
0
27 Nov 2023
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
Jiemin Fang
Junjie Wang
Xiaopeng Zhang
Lingxi Xie
Qi Tian
3DGSDiffM
127
117
0
27 Nov 2023
Continual Instruction Tuning for Large Multimodal Models
Continual Instruction Tuning for Large Multimodal Models
Jinghan He
Haiyun Guo
Ming Tang
Jinqiao Wang
VLMMLLMCLLKELM
85
26
0
27 Nov 2023
Previous
123...4344454647
Next