ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
Pseudo-triplet Guided Few-shot Composed Image Retrieval
Pseudo-triplet Guided Few-shot Composed Image Retrieval
Bohan Hou
Haoqiang Lin
Haokun Wen
Meng Liu
Xuemeng Song
99
5
0
08 Jul 2024
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot
  Performance
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance
Jiedong Zhuang
Jiaqi Hu
Lianrui Mu
Rui Hu
Xiaoyu Liang
Jiangnan Ye
Haoji Hu
CLIPVLM
104
4
0
08 Jul 2024
LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video
  Reconstruction
LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction
Kanghao Chen
Hangyu Li
Jiazhou Zhou
Zeyu Wang
Lin Wang
DiffMVGen
82
2
0
08 Jul 2024
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal
  Biomedical Representation
GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation
Chenxin Li
Xinyu Liu
Cheng Wang
Yifan Liu
Weihao Yu
Jing Shao
Yixuan Yuan
95
18
0
08 Jul 2024
OneDiff: A Generalist Model for Image Difference Captioning
OneDiff: A Generalist Model for Image Difference Captioning
Erdong Hu
Longteng Guo
Tongtian Yue
Zijia Zhao
Shuning Xue
Jing Liu
VLM
125
2
0
08 Jul 2024
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation
  Models on Embodied Task Planning
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
Min Zhang
Jianye Hao
Xian Fu
Peilong Han
Hao Zhang
Lei Shi
Hongyao Tang
Yan Zheng
110
1
0
06 Jul 2024
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual
  Contexts
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao
Edward Sun
Tianyu Liu
Wei Wang
LRM
84
42
0
06 Jul 2024
Zero-shot Object Counting with Good Exemplars
Zero-shot Object Counting with Good Exemplars
Huilin Zhu
Jingling Yuan
Zhengwei Yang
Yu Guo
Zheng Wang
Xian Zhong
Shengfeng He
VLM
91
10
0
06 Jul 2024
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for
  Text-to-Image Generation?
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen
Yichao Du
Zichen Wen
Yiyang Zhou
Chenhang Cui
...
Jiawei Zhou
Zhuokai Zhao
Rafael Rafailov
Chelsea Finn
Huaxiu Yao
EGVMMLLM
117
35
0
05 Jul 2024
VCoME: Verbal Video Composition with Multimodal Editing Effects
VCoME: Verbal Video Composition with Multimodal Editing Effects
Weibo Gong
Xiaojie Jin
Xin Li
Dongliang He
Xinglong Wu
75
0
0
05 Jul 2024
Rethinking Visual Prompting for Multimodal Large Language Models with
  External Knowledge
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge
Yuanze Lin
Yunsheng Li
Dongdong Chen
Weijian Xu
Ronald Clark
Philip Torr
Lu Yuan
LRMVLM
81
8
0
05 Jul 2024
Not (yet) the whole story: Evaluating Visual Storytelling Requires More
  than Measuring Coherence, Grounding, and Repetition
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
66
6
0
05 Jul 2024
MobileFlow: A Multimodal LLM For Mobile GUI Agent
MobileFlow: A Multimodal LLM For Mobile GUI Agent
Songqin Nong
Jiali Zhu
Rui Wu
Jiongchao Jin
Shuo Shan
Xiutian Huang
Wenhao Xu
67
11
0
05 Jul 2024
Smart Vision-Language Reasoners
Smart Vision-Language Reasoners
Denisa Roberts
Lucas Roberts
VLMReLMLRM
77
4
0
05 Jul 2024
Slice-100K: A Multimodal Dataset for Extrusion-based 3D Printing
Slice-100K: A Multimodal Dataset for Extrusion-based 3D Printing
Anushrut Jignasu
Kelly O. Marshall
Ankush Kumar Mishra
Lucas Nerone Rillo
Baskar Ganapathysubramanian
Aditya Balu
Chinmay Hegde
Adarsh Krishnamurthy
62
0
0
04 Jul 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language
  Representation Learning
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
144
6
0
04 Jul 2024
Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal
  Image Restoration
Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration
Yuhong Zhang
Hengsheng Zhang
Xinning Chai
Zhengxue Cheng
Rong Xie
Li Song
Wenjun Zhang
DiffM
76
5
0
04 Jul 2024
Precision at Scale: Domain-Specific Datasets On-Demand
Precision at Scale: Domain-Specific Datasets On-Demand
Jesús M. Rodríguez-de-Vera
Imanol G. Estepa
Ignacio Sarasúa
Bhalaji Nagarajan
Petia Radeva
89
2
0
03 Jul 2024
HEMM: Holistic Evaluation of Multimodal Foundation Models
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang
Akshay Goindani
Talha Chafekar
Leena Mathur
Haofei Yu
Ruslan Salakhutdinov
Louis-Philippe Morency
100
16
0
03 Jul 2024
ESQA: Event Sequences Question Answering
ESQA: Event Sequences Question Answering
Irina Abdullaeva
Andrei Filatov
Mikhail Orlov
Ivan Karpukhin
Viacheslav Vasilev
Denis Dimitrov
Andrey Kuznetsov
Ivan A Kireev
Andrey Savchenko
86
0
0
03 Jul 2024
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Multi-Task Domain Adaptation for Language Grounding with 3D Objects
Penglei Sun
Yaoxian Song
Xinglin Pan
Peijie Dong
Xiaofei Yang
Qiang-qiang Wang
Zhixu Li
Tiefeng Li
Xiaowen Chu
130
1
0
03 Jul 2024
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition
  and Analysis
MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
Lei Chen
Feng Yan
Yujie Zhong
Shaoxiang Chen
Zequn Jie
Lin Ma
128
4
0
03 Jul 2024
Open Scene Graphs for Open World Object-Goal Navigation
Open Scene Graphs for Open World Object-Goal Navigation
Joel Loo
Zhanxin Wu
David Hsu
LM&Ro
94
5
0
02 Jul 2024
TokenPacker: Efficient Visual Projector for Multimodal LLM
TokenPacker: Efficient Visual Projector for Multimodal LLM
Wentong Li
Yuqian Yuan
Jian Liu
Dongqi Tang
Song Wang
Jie Qin
Jianke Zhu
Lei Zhang
MLLM
151
67
0
02 Jul 2024
An End-to-End Speech Summarization Using Large Language Model
An End-to-End Speech Summarization Using Large Language Model
Hengchao Shang
Zongyao Li
Jiaxin Guo
Shaojun Li
Zhiqiang Rao
Yuanchang Luo
Daimeng Wei
Hao Yang
74
0
0
02 Jul 2024
SADL: An Effective In-Context Learning Method for Compositional Visual
  QA
SADL: An Effective In-Context Learning Method for Compositional Visual QA
Long Hoang Dang
T. Le
Vuong Le
Tu Minh Phuong
Truyen Tran
ReLMCoGe
103
3
0
02 Jul 2024
GVDIFF: Grounded Text-to-Video Generation with Diffusion Models
GVDIFF: Grounded Text-to-Video Generation with Diffusion Models
Huanzhang Dou
Ruixiang Li
Wei Su
Xi Li
DiffM
94
1
0
02 Jul 2024
Proposal Report for the 2nd SciCAP Competition 2024
Proposal Report for the 2nd SciCAP Competition 2024
Pengpeng Li
Tingmin Li
Jingyuan Wang
Boyuan Wang
Yang Yang
52
2
0
02 Jul 2024
VSP: Assessing the dual challenges of perception and reasoning in
  spatial planning tasks for VLMs
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs
Qiucheng Wu
Handong Zhao
Michael Stephen Saxon
T. Bui
William Yang Wang
Yang Zhang
Shiyu Chang
CoGe
93
7
0
02 Jul 2024
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities
Chenming Zhu
Tai Wang
Wenwei Zhang
Kai Chen
Xihui Liu
ReLMLRM
112
24
0
01 Jul 2024
Tree Search for Language Model Agents
Tree Search for Language Model Agents
Jing Yu Koh
Stephen Marcus McAleer
Daniel Fried
Ruslan Salakhutdinov
LM&RoLLMAGLRM
131
75
0
01 Jul 2024
Semantic Compositions Enhance Vision-Language Contrastive Learning
Semantic Compositions Enhance Vision-Language Contrastive Learning
Maxwell Mbabilla Aladago
Lorenzo Torresani
Soroush Vosoughi
CoGeVLMCLIP
83
0
0
01 Jul 2024
We-Math: Does Your Large Multimodal Model Achieve Human-like
  Mathematical Reasoning?
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao
Qiuna Tan
Guanting Dong
Minhui Wu
Chong Sun
...
Yida Xu
Muxi Diao
Zhimin Bao
Chen Li
Honggang Zhang
VLMLRM
115
56
0
01 Jul 2024
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models
Ruinan Jin
Zikang Xu
Yuan Zhong
Qiongsong Yao
Qi Dou
S. Kevin Zhou
Xiaoxiao Li
VLM
111
17
0
01 Jul 2024
Tokenize the World into Object-level Knowledge to Address Long-tail
  Events in Autonomous Driving
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving
Ran Tian
Boyi Li
Xinshuo Weng
Yuxiao Chen
Edward Schmerling
Yue Wang
Boris Ivanovic
Marco Pavone
133
25
0
01 Jul 2024
InstantStyle-Plus: Style Transfer with Content-Preserving in
  Text-to-Image Generation
InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation
Haofan Wang
Peng-Fei Xing
Renyuan Huang
Hao Ai
Qixun Wang
Xu Bai
DiffM
110
25
0
30 Jun 2024
Tarsier: Recipes for Training and Evaluating Large Video Description
  Models
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
110
67
0
30 Jun 2024
Hierarchical Memory for Long Video QA
Hierarchical Memory for Long Video QA
Yiqin Wang
Haoji Zhang
Yansong Tang
Yong-Jin Liu
Jiashi Feng
Jifeng Dai
Xiaojie Jin
129
4
0
30 Jun 2024
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models
  via Counterfactual Probing
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing
Yisong Xiao
Aishan Liu
QianJia Cheng
Zhenfei Yin
Siyuan Liang
Jiapeng Li
Jing Shao
Xianglong Liu
Dacheng Tao
124
8
0
30 Jun 2024
Urban Visual Appeal According to ChatGPT: Contrasting AI and Human
  Insights
Urban Visual Appeal According to ChatGPT: Contrasting AI and Human Insights
M. Malekzadeh
Elias S Willberg
Jussi Torkko
T. Toivonen
45
1
0
29 Jun 2024
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through
  Multi-agent Collaboration
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration
Yuxuan Sun
Yunlong Zhang
Yixuan Si
Chenglu Zhu
Zhongyi Shui
Kai Zhang
Jingxiong Li
Xingheng Lyu
Tao Lin
Lin Yang
LM&MAVLMMedIm
115
12
0
28 Jun 2024
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework
  for Multimodal LLMs
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Sukmin Yun
Haokun Lin
Rusiru Thushara
Mohammad Qazim Bhat
Yongxin Wang
...
Timothy Baldwin
Zhengzhong Liu
Eric P. Xing
Xiaodan Liang
Zhiqiang Shen
101
14
0
28 Jun 2024
GM-DF: Generalized Multi-Scenario Deepfake Detection
GM-DF: Generalized Multi-Scenario Deepfake Detection
Yingxin Lai
Zitong Yu
Jing Yang
Bin Li
Xiangui Kang
Linlin Shen
136
11
0
28 Jun 2024
MM-Instruct: Generated Visual Instructions for Large Multimodal Model
  Alignment
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Jihao Liu
Xin Huang
Jinliang Zheng
Boxiao Liu
Jia Wang
Osamu Yoshie
Yu Liu
Hongsheng Li
MLLMSyDa
63
4
0
28 Jun 2024
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
Jinming Li
Yichen Zhu
Zhiyuan Xu
Jindong Gu
Minjie Zhu
Xin Liu
Ning Liu
Yaxin Peng
Feifei Feng
Jian Tang
LRMLM&Ro
105
8
0
28 Jun 2024
Multimodal Learning and Cognitive Processes in Radiology: MedGaze for
  Chest X-ray Scanpath Prediction
Multimodal Learning and Cognitive Processes in Radiology: MedGaze for Chest X-ray Scanpath Prediction
Akash Awasthi
Ngan Le
Zhigang Deng
Rishi Agrawal
Carol C. Wu
Hien Van Nguyen
MedIm
35
0
0
28 Jun 2024
PathAlign: A vision-language model for whole slide images in
  histopathology
PathAlign: A vision-language model for whole slide images in histopathology
Faruk Ahmed
Andrew Sellergren
Lin Yang
Shawn Xu
Boris Babenko
...
S. Shetty
Daniel Golden
Yun-Hui Liu
David F. Steiner
Ellery Wulczyn
LM&MAVLM
109
18
0
27 Jun 2024
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
  Understanding
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Tao Zhang
Xiangtai Li
Hao Fei
Haobo Yuan
Shengqiong Wu
Shunping Ji
Chen Change Loy
Shuicheng Yan
LRMMLLMVLM
141
63
0
27 Jun 2024
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
  Multimodal LLMs at Scale
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen
Ruyi Ouyang
Anningzhe Gao
Shunian Chen
Guiming Hardy Chen
...
Zhenyang Cai
Ke Ji
Guangjun Yu
Xiang Wan
Benyou Wang
MedImLM&MA
82
50
0
27 Jun 2024
From Efficient Multimodal Models to World Models: A Survey
From Efficient Multimodal Models to World Models: A Survey
Xinji Mai
Zeng Tao
Junxiong Lin
Haoran Wang
Yang Chang
Yanlan Kang
Yan Wang
Wenqiang Zhang
97
6
0
27 Jun 2024
Previous
123...303132...464748
Next