ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
DoLLM: How Large Language Models Understanding Network Flow Data to
  Detect Carpet Bombing DDoS
DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS
Qingyang Li
Yihang Zhang
Zhidong Jia
Yannan Hu
Lei Zhang
Jianrong Zhang
Yongming Xu
Yong Cui
Xinggong Zhang
Xinggong Zhang
AI4CE
82
8
0
13 May 2024
Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout
  Analysis
Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis
Tianci Bi
Xiaoyi Zhang
Zhizheng Zhang
Wenxuan Xie
Cuiling Lan
Yan Lu
Nanning Zheng
VLM
79
1
0
13 May 2024
Sakuga-42M Dataset: Scaling Up Cartoon Research
Sakuga-42M Dataset: Scaling Up Cartoon Research
Zhenglin Pan
Yu Zhu
Yuxuan Mu
83
7
0
13 May 2024
TAI++: Text as Image for Multi-Label Image Classification by Co-Learning
  Transferable Prompt
TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt
Xiangyu Wu
Qingjun Jiang
Yang Yang
Yifeng Wu
Qingguo Chen
Jianfeng Lu
VLMVPVLM
101
8
0
11 May 2024
Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in
  Recognizing Color-Emotion Associations
Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations
Hanna-Sophia Widhoelzl
Ece Takmaz
79
2
0
10 May 2024
Probing Multimodal LLMs as World Models for Driving
Probing Multimodal LLMs as World Models for Driving
Shiva Sreeram
Tsun-Hsuan Wang
Alaa Maalouf
Guy Rosman
S. Karaman
Daniela Rus
91
10
0
09 May 2024
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
Jiachen Li
Xinyao Wang
Sijie Zhu
Chia-Wen Kuo
Lu Xu
Fan Chen
Jitesh Jain
Humphrey Shi
Longyin Wen
MLLMMoE
100
33
0
09 May 2024
LangCell: Language-Cell Pre-training for Cell Identity Understanding
LangCell: Language-Cell Pre-training for Cell Identity Understanding
Suyuan Zhao
Jiahuan Zhang
Yushuai Wu
Yizhen Luo
Zaiqing Nie
VLM
142
8
0
09 May 2024
Exploring the Capabilities of Large Multimodal Models on Dense Text
Exploring the Capabilities of Large Multimodal Models on Dense Text
Shuo Zhang
Biao Yang
Zhang Li
Zhiyin Ma
Yuliang Liu
Xiang Bai
VLM
81
11
0
09 May 2024
A Survey on Personalized Content Synthesis with Diffusion Models
A Survey on Personalized Content Synthesis with Diffusion Models
Xu-Lu Zhang
Xiao Wei
Wengyu Zhang
Jinlin Wu
Jiaxin Wu
Zhen Lei
Zhaoxiang Zhang
Zhen Lei
Qing Li
EGVM
248
22
0
09 May 2024
VisionGraph: Leveraging Large Multimodal Models for Graph Theory
  Problems in Visual Context
VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
Yunxin Li
Baotian Hu
Haoyuan Shi
Wei Wang
Longyue Wang
Min Zhang
LRM
68
16
0
08 May 2024
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models
Prannay Kaul
Zhizhong Li
Hao Yang
Yonatan Dukler
Ashwin Swaminathan
C. Taylor
Stefano Soatto
HILM
166
18
0
08 May 2024
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control
Yide Shentu
Philipp Wu
Aravind Rajeswaran
Pieter Abbeel
98
15
0
08 May 2024
On the Foundations of Earth and Climate Foundation Models
On the Foundations of Earth and Climate Foundation Models
Xiao Xiang Zhu
Zhitong Xiong
Yi Wang
Adam J. Stewart
Konrad Heidler
Yuanyuan Wang
Zhenghang Yuan
Thomas Dujardin
Qingsong Xu
Yilei Shi
AI4ClAI4CE
132
25
0
07 May 2024
Language-Image Models with 3D Understanding
Language-Image Models with 3D Understanding
Jang Hyun Cho
Boris Ivanovic
Yulong Cao
Edward Schmerling
Yue Wang
...
Boyi Li
Yurong You
Philipp Krahenbuhl
Yan Wang
Marco Pavone
LRM
72
19
0
06 May 2024
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
Zhizhao Duan
Hao Cheng
Duo Xu
Xi Wu
Xiangxie Zhang
Xi Ye
Zhen Xie
64
8
0
06 May 2024
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
Jiacheng Cheng
Hijung Valentina Shin
Nuno Vasconcelos
Bryan C. Russell
Fabian Caba Heilbron
VLM
65
1
0
06 May 2024
Video Diffusion Models: A Survey
Video Diffusion Models: A Survey
Andrew Melnik
Michal Ljubljanac
Cong Lu
Qi Yan
Weiming Ren
Helge J. Ritter
VGen
150
16
0
06 May 2024
Octopi: Object Property Reasoning with Large Tactile-Language Models
Octopi: Object Property Reasoning with Large Tactile-Language Models
Samson Yu
Kelvin Lin
Anxing Xiao
Jiafei Duan
Harold Soh
LRM
106
31
0
05 May 2024
What matters when building vision-language models?
What matters when building vision-language models?
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
110
177
0
03 May 2024
Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic
  Labeling using Foundation Models
Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models
Mohamad Al Al Mdfaa
Raghad Salameh
Sergey Zagoruyko
Gonzalo Ferrer
78
1
0
03 May 2024
Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
Xuelong Geng
Tianyi Xu
Kun Wei
Bingshen Mu
Hongfei Xue
...
Pengcheng Guo
Yuhang Dai
Longhao Li
Mingchen Shao
Lei Xie
82
12
0
03 May 2024
MANTIS: Interleaved Multi-Image Instruction Tuning
MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang
Xuan He
Huaye Zeng
Cong Wei
Max Ku
Qian Liu
Wenhu Chen
VLMMLLM
130
125
0
02 May 2024
Understanding Retrieval-Augmented Task Adaptation for Vision-Language
  Models
Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Yifei Ming
Yixuan Li
VLM
132
8
0
02 May 2024
Multi-modal Learnable Queries for Image Aesthetics Assessment
Multi-modal Learnable Queries for Image Aesthetics Assessment
Zhiwei Xiong
Yunfan Zhang
Zhiqi Shen
Peiran Ren
Han Yu
EGVM
70
1
0
02 May 2024
Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking
  Evaluation Using Ensembled CLIP and Consensus Scores
Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores
Kiyoon Jeong
Woojun Lee
Woongchan Nam
Minjeong Ma
Pilsung Kang
60
2
0
02 May 2024
FITA: Fine-grained Image-Text Aligner for Radiology Report Generation
FITA: Fine-grained Image-Text Aligner for Radiology Report Generation
Honglong Yang
Hui Tang
Xiaomeng Li
MedIm
80
1
0
02 May 2024
Obtaining Favorable Layouts for Multiple Object Generation
Obtaining Favorable Layouts for Multiple Object Generation
Barak Battash
Amit Rozner
Lior Wolf
Ofir Lindenbaum
DiffM
58
2
0
01 May 2024
ASAM: Boosting Segment Anything Model with Adversarial Tuning
ASAM: Boosting Segment Anything Model with Adversarial Tuning
Bo Li
Haoke Xiao
Lv Tang
107
11
0
01 May 2024
Lightplane: Highly-Scalable Components for Neural 3D Fields
Lightplane: Highly-Scalable Components for Neural 3D Fields
Ang Cao
Justin Johnson
Andrea Vedaldi
David Novotny
99
9
0
30 Apr 2024
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Yunhao Ge
Fangyin Wei
Siddharth Gururani
Nayeon Lee
Xuan Li
Huayu Chen
CoGeDiffM
75
17
0
30 Apr 2024
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table
  Domains
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains
Yoonsik Kim
Moonbin Yim
Ka Yeon Song
LMTD
113
23
0
30 Apr 2024
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIPVLM
189
23
0
30 Apr 2024
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in
  Radiology with General-Domain Large Language Model
Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model
Seonhee Cho
Choonghan Kim
Jiho Lee
Chetan Chilkunda
Sujin Choi
Joo Heung Yoon
77
1
0
29 Apr 2024
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question
  Answering by Understanding Vietnamese Text in Images
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images
Huy Quang Pham
Thang Kien-Bao Nguyen
Quan Van Nguyen
Dan Quang Tran
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
97
4
0
29 Apr 2024
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Letitia Parcalabescu
Anette Frank
MLLMCoGeVLM
168
6
0
29 Apr 2024
Towards Incremental Learning in Large Language Models: A Critical Review
Towards Incremental Learning in Large Language Models: A Critical Review
M. Jovanovic
Peter Voss
ELMCLLKELM
121
5
0
28 Apr 2024
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View
  Diffusion Model
Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model
Xiaolong Li
Jiawei Mo
Ying Wang
Chethan Parameshwara
Xiaohan Fei
Ashwin Swaminathan
C. Taylor
Zhuowen Tu
Paolo Favaro
Stefano Soatto
103
4
0
28 Apr 2024
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Navve Wasserman
Noam Rotstein
Roy Ganz
Ron Kimmel
DiffM
141
16
0
28 Apr 2024
Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission
Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission
Mingyu Yang
Bowen Liu
Boyang Wang
Hun-Seok Kim
DiffM
111
6
0
27 Apr 2024
Learning text-to-video retrieval from image captioning
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
73
3
0
26 Apr 2024
MovieChat+: Question-aware Sparse Memory for Long Video Question
  Answering
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Enxin Song
Wenhao Chai
Tianbo Ye
Lei Li
Xi Li
Gaoang Wang
VLMMLLM
117
34
0
26 Apr 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
  Dense Captioning
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLMVLM
122
185
0
25 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
205
644
0
25 Apr 2024
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
  Text-Rich Visual Comprehension
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Bohao Li
Yuying Ge
Yi Chen
Yixiao Ge
Ruimao Zhang
Ying Shan
VLM
95
60
0
25 Apr 2024
Continual Learning of Large Language Models: A Comprehensive Survey
Continual Learning of Large Language Models: A Comprehensive Survey
Haizhou Shi
Zihao Xu
Hengyi Wang
Weiyi Qin
Wenyuan Wang
Yibin Wang
Zifeng Wang
Sayna Ebrahimi
Hao Wang
CLLKELMLRM
165
88
0
25 Apr 2024
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
NTIRE 2024 Quality Assessment of AI-Generated Content Challenge
Xiaohong Liu
Xiongkuo Min
Guangtao Zhai
Chunyi Li
Tengchuan Kou
...
Qi Yan
Youran Qu
Xiaohui Zeng
Lele Wang
Renjie Liao
110
31
0
25 Apr 2024
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
Hongxia Xie
Chu-Jun Peng
Yu-Wen Tseng
Hung-Jen Chen
Chan-Feng Hsu
Hong-Han Shuai
Wen-Huang Cheng
123
19
0
25 Apr 2024
Zero-Shot Distillation for Image Encoders: How to Make Effective Use of
  Synthetic Data
Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data
Niclas Popp
J. H. Metzen
Matthias Hein
VLM
107
1
0
25 Apr 2024
Energy-Latency Manipulation of Multi-modal Large Language Models via
  Verbose Samples
Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Kuofeng Gao
Jindong Gu
Yang Bai
Shu-Tao Xia
Philip Torr
Wei Liu
Zhifeng Li
132
13
0
25 Apr 2024
Previous
123...373839...464748
Next