ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLM
    MLLM
ArXivPDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 822 papers shown
Title
Dual-View Visual Contextualization for Web Navigation
Dual-View Visual Contextualization for Web Navigation
Jihyung Kil
Chan Hee Song
Boyuan Zheng
Xiang Deng
Yu-Chuan Su
Wei-Lun Chao
EgoV
22
12
0
06 Feb 2024
Convincing Rationales for Visual Question Answering Reasoning
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
44
1
0
06 Feb 2024
InstanceDiffusion: Instance-level Control for Image Generation
InstanceDiffusion: Instance-level Control for Image Generation
Xudong Wang
Trevor Darrell
Sai Saketh Rambhatla
Rohit Girdhar
Ishan Misra
VLM
DiffM
34
84
0
05 Feb 2024
Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual
  Text Processing
Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing
Yan Shu
Weichao Zeng
Zhenhang Li
Fangmin Zhao
Yu Zhou
32
3
0
05 Feb 2024
PVLR: Prompt-driven Visual-Linguistic Representation Learning for
  Multi-Label Image Recognition
PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition
Hao Tan
Zichang Tan
Jun Li
Jun Wan
Zhen Lei
VLM
36
0
0
31 Jan 2024
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large
  Models
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
Yi Zhao
Yilin Zhang
Rong Xiang
Jing Li
Hillming Li
43
16
0
29 Jan 2024
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And
  Image-Prompts
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts
Jingyu Zhuang
Di Kang
Yan-Pei Cao
Guanbin Li
Liang Lin
Ying Shan
DiffM
3DGS
50
38
0
26 Jan 2024
LanDA: Language-Guided Multi-Source Domain Adaptation
LanDA: Language-Guided Multi-Source Domain Adaptation
Zhenbin Wang
Lei Zhang
Lituan Wang
Minjuan Zhu
35
10
0
25 Jan 2024
BootPIG: Bootstrapping Zero-shot Personalized Image Generation
  Capabilities in Pretrained Diffusion Models
BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models
Senthil Purushwalkam
Akash Gokul
Chenyu You
Nikhil Naik
DiffM
39
17
0
25 Jan 2024
Towards 3D Molecule-Text Interpretation in Language Models
Towards 3D Molecule-Text Interpretation in Language Models
Sihang Li
Zhiyuan Liu
Yancheng Luo
Xiang Wang
Xiangnan He
Kenji Kawaguchi
Tat-Seng Chua
Qi Tian
AI4CE
35
42
0
25 Jan 2024
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
Ci-Siang Lin
Chien-Yi Wang
Yu-Chiang Frank Wang
Min-Hung Chen
VLM
23
0
0
22 Jan 2024
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal
  Models for Video Question Answering
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Haibo Wang
Chenghang Lai
Yixuan Sun
Weifeng Ge
31
5
0
19 Jan 2024
Large Language Models are Efficient Learners of Noise-Robust Speech
  Recognition
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Yuchen Hu
Chen Chen
Chao-Han Huck Yang
Ruizhe Li
Chao Zhang
Pin-Yu Chen
Ensiong Chng
27
20
0
19 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLM
LLMAG
52
19
0
19 Jan 2024
Temporal Insight Enhancement: Mitigating Temporal Hallucination in
  Multimodal Large Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Li Sun
Liuan Wang
Jun Sun
Takayuki Okatani
MLLM
19
0
0
18 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLM
MLLM
33
2
0
17 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
69
35
0
16 Jan 2024
Towards A Better Metric for Text-to-Video Generation
Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu
Guian Fang
Haoning Wu
Xintao Wang
Yixiao Ge
...
Rui Zhao
Weisi Lin
Wynne Hsu
Ying Shan
Mike Zheng Shou
VGen
37
34
0
15 Jan 2024
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection and Explanation via Large Language Models
Hui Lv
Qianru Sun
33
21
0
11 Jan 2024
Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding
Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding
Yuu Jinnai
Ukyo Honda
Tetsuro Morimura
Peinan Zhang
34
6
0
10 Jan 2024
Low-Resource Vision Challenges for Foundation Models
Low-Resource Vision Challenges for Foundation Models
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
30
5
0
09 Jan 2024
ChartAssisstant: A Universal Chart Multimodal Language Model via
  Chart-to-Table Pre-training and Multitask Instruction Tuning
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
Fanqing Meng
Wenqi Shao
Quanfeng Lu
Peng Gao
Kaipeng Zhang
Yu Qiao
Ping Luo
31
46
0
04 Jan 2024
GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for
  One-shot Generalizable Neural Radiance Fields
GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields
X. Pan
Zongxin Yang
Shuai Bai
Yi Yang
DiffM
OffRL
30
1
0
01 Jan 2024
LISA++: An Improved Baseline for Reasoning Segmentation with Large
  Language Model
LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model
Senqiao Yang
Tianyuan Qu
Xin Lai
Zhuotao Tian
Bohao Peng
Shu Liu
Jiaya Jia
VLM
21
28
0
28 Dec 2023
EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
EFHQ: Multi-purpose ExtremePose-Face-HQ dataset
T. Dao
D. Vu
Cuong Pham
Anh Tran
29
1
0
28 Dec 2023
SSR-Encoder: Encoding Selective Subject Representation for
  Subject-Driven Generation
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
Yuxuan Zhang
Yiren Song
Jiaming Liu
Rui Wang
Jinpeng Yu
...
Huaxia Li
Xu Tang
Yao Hu
Han Pan
Zhongliang Jing
49
58
0
26 Dec 2023
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models
Zhihao Chen
Bin Hu
Chuang Niu
Tao Chen
Yuxin Li
Hongming Shan
Ge Wang
LM&MA
MLLM
27
4
0
25 Dec 2023
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR
  Understanding
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
Senqiao Yang
Jiaming Liu
Ray Zhang
Mingjie Pan
Zoey Guo
Xiaoqi Li
Zehui Chen
Peng Gao
Yandong Guo
Shanghang Zhang
3DV
26
58
0
21 Dec 2023
A Strong Baseline for Temporal Video-Text Alignment
A Strong Baseline for Temporal Video-Text Alignment
Zeqian Li
Qirui Chen
Tengda Han
Ya Zhang
Yanfeng Wang
Weidi Xie
AI4TS
VGen
43
5
0
21 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
48
29
0
19 Dec 2023
Mask Grounding for Referring Image Segmentation
Mask Grounding for Referring Image Segmentation
Yong Xien Chng
Henry Zheng
Yizeng Han
Xuchong Qiu
Gao Huang
ISeg
ObjD
40
15
0
19 Dec 2023
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition
Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition
Tianlin Li
Yao Rong
Shiao Wang
Yuan Chen
Zhe Wu
Bowei Jiang
Yonghong Tian
Jin Tang
ViT
84
3
0
18 Dec 2023
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Mingsheng Li
Xin Chen
C. Zhang
Sijin Chen
Erik Cambria
Fukun Yin
Gang Yu
Tao Chen
36
24
0
17 Dec 2023
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral
  Planning States for Autonomous Driving
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
Wenhai Wang
Jiangwei Xie
ChuanYang Hu
Haoming Zou
Jianan Fan
...
Lewei Lu
Xizhou Zhu
Xiaogang Wang
Yu Qiao
Jifeng Dai
36
125
0
14 Dec 2023
Depicting Beyond Scores: Advancing Image Quality Assessment through
  Multi-modal Language Models
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
Zhiyuan You
Zheyuan Li
Jinjin Gu
Zhenfei Yin
Tianfan Xue
Chao Dong
EGVM
26
35
0
14 Dec 2023
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
Hao Shao
Yuxuan Hu
Letian Wang
Steven L. Waslander
Yu Liu
Hongsheng Li
ELM
33
113
0
12 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
Large Scale Foundation Models for Intelligent Manufacturing
  Applications: A Survey
Large Scale Foundation Models for Intelligent Manufacturing Applications: A Survey
Haotian Zhang
S. D. Semujju
Zhicheng Wang
Xianwei Lv
Kang Xu
...
Jing Wu
Zhuo Long
Wensheng Liang
Xiaoguang Ma
Ruiyan Zhuang
UQCV
AI4TS
AI4CE
29
4
0
11 Dec 2023
NLLG Quarterly arXiv Report 09/23: What are the most influential current
  AI Papers?
NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers?
Ran Zhang
Aida Kostikova
Christoph Leiter
Jonas Belouadi
Daniil Larionov
Yanran Chen
Vivian Fresen
Steffen Eger
39
0
0
09 Dec 2023
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Zhen Li
Mingdeng Cao
Xintao Wang
Zhongang Qi
Ming-Ming Cheng
Ying Shan
DiffM
60
189
0
07 Dec 2023
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D
  priors
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
Lihe Ding
Shaocong Dong
Zhanpeng Huang
Zibin Wang
Yiyuan Zhang
Kaixiong Gong
Dan Xu
Tianfan Xue
26
17
0
07 Dec 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun
Ye Fang
Tong Wu
Pan Zhang
Yuhang Zang
Shu Kong
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
VLM
CLIP
51
83
0
06 Dec 2023
MotionCtrl: A Unified and Flexible Motion Controller for Video
  Generation
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
Zhouxia Wang
Ziyang Yuan
Xintao Wang
Tianshui Chen
Menghan Xia
Ping Luo
Ying Shan
DiffM
VGen
47
198
0
06 Dec 2023
UFineBench: Towards Text-based Person Retrieval with Ultra-fine
  Granularity
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
Jia-li Zuo
Hanyu Zhou
Ying Nie
Feng Zhang
Tianyu Guo
Nong Sang
Yunhe Wang
Changxin Gao
35
17
0
06 Dec 2023
Describing Differences in Image Sets with Natural Language
Describing Differences in Image Sets with Natural Language
Lisa Dunlap
Yuhui Zhang
Xiaohan Wang
Ruiqi Zhong
Trevor Darrell
Jacob Steinhardt
Joseph E. Gonzalez
Serena Yeung-Levy
CoGe
VLM
32
30
0
05 Dec 2023
Training on Synthetic Data Beats Real Data in Multimodal Relation
  Extraction
Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction
Zilin Du
Haoxin Li
Xu Guo
Boyang Li
35
1
0
05 Dec 2023
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video
  Grounding with Multimodal Large Language Model
EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model
Guozhang Li
Xinpeng Ding
De-Chun Cheng
Jie Li
Nannan Wang
Xinbo Gao
34
1
0
05 Dec 2023
Object Recognition as Next Token Prediction
Object Recognition as Next Token Prediction
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
40
9
0
04 Dec 2023
Semantics-aware Motion Retargeting with Vision-Language Models
Semantics-aware Motion Retargeting with Vision-Language Models
Haodong Zhang
Zhike Chen
Haocheng Xu
Lei Hao
Xiaofei Wu
Songcen Xu
Zhensong Zhang
Yue Wang
Rong Xiong
38
4
0
04 Dec 2023
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language
  Models
InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Xunguang Wang
Zhenlan Ji
Pingchuan Ma
Zongjie Li
Shuai Wang
MLLM
43
11
0
04 Dec 2023
Previous
123...121314151617
Next