ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang
Wei Lin
M. Jehanzeb Mirza
Jacob A. Hansen
Sivan Doveh
...
Trevor Darrel
Chuang Gan
Aude Oliva
Rogerio Feris
Leonid Karlinsky
CoGeLRM
94
9
0
12 Jun 2024
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
  Streams
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang
Yiqin Wang
Yansong Tang
Yong-Jin Liu
Jiashi Feng
Jifeng Dai
Xiaojie Jin
112
45
0
12 Jun 2024
A Concept-Based Explainability Framework for Large Multimodal Models
A Concept-Based Explainability Framework for Large Multimodal Models
Jayneel Parekh
Pegah Khayatan
Mustafa Shukor
A. Newson
Matthieu Cord
102
18
0
12 Jun 2024
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A
  Survey
Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey
Hao Yang
Yanyan Zhao
Yang Wu
Shilong Wang
Tian Zheng
Hongbo Zhang
Zongyang Ma
Wanxiang Che
Bing Qin
135
14
0
12 Jun 2024
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities
  in Large Vision-Language Models
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
Shimin Chen
Yitian Yuan
Shaoxiang Chen
Zequn Jie
Lin Ma
VLM
84
4
0
12 Jun 2024
Grounding Multimodal Large Language Models in Actions
Grounding Multimodal Large Language Models in Actions
Andrew Szot
Bogdan Mazoure
Harsh Agrawal
Devon Hjelm
Z. Kira
Alexander Toshev
LM&Ro
91
14
0
12 Jun 2024
An Image is Worth 32 Tokens for Reconstruction and Generation
An Image is Worth 32 Tokens for Reconstruction and Generation
Qihang Yu
Mark Weber
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
VLMViT
180
104
0
11 Jun 2024
Commonsense-T2I Challenge: Can Text-to-Image Generation Models
  Understand Commonsense?
Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
Xingyu Fu
Muyu He
Yujie Lu
William Yang Wang
Dan Roth
EGVMLRM
105
21
0
11 Jun 2024
Situational Awareness Matters in 3D Vision Language Reasoning
Situational Awareness Matters in 3D Vision Language Reasoning
Yunze Man
Liang-Yan Gui
Yu-Xiong Wang
91
18
0
11 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
87
6
0
11 Jun 2024
Image Textualization: An Automatic Framework for Creating Accurate and
  Detailed Image Descriptions
Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
Renjie Pi
Jianshu Zhang
Jipeng Zhang
Boyao Wang
Zhekai Chen
Tong Zhang
3DV
95
24
0
11 Jun 2024
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based
  Reformulation and Box-Based Segmentation
Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation
Jinyuan Li
Ziyan Li
Han Li
Jianfei Yu
Rui Xia
Di Sun
Gang Pan
73
2
0
11 Jun 2024
Needle In A Multimodal Haystack
Needle In A Multimodal Haystack
Weiyun Wang
Shuibo Zhang
Yiming Ren
Yuchen Duan
Tiantong Li
...
Ping Luo
Yu Qiao
Jifeng Dai
Wenqi Shao
Wenhai Wang
VLM
118
24
0
11 Jun 2024
Translating speech with just images
Translating speech with just images
Dan Oneaţă
Herman Kamper
VLM
43
1
0
11 Jun 2024
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance
X. Wang
Siming Fu
Qihan Huang
Wanggui He
Hao Jiang
DiffM
148
53
0
11 Jun 2024
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
Sijia Chen
Yibo Wang
Yi-Feng Wu
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Lijun Zhang
LLMAGLRM
125
18
0
11 Jun 2024
TRINS: Towards Multimodal Language Models that Can Read
TRINS: Towards Multimodal Language Models that Can Read
Ruiyi Zhang
Yanzhe Zhang
Jian Chen
Yufan Zhou
Jiuxiang Gu
Changyou Chen
Tong Sun
VLM
82
6
0
10 Jun 2024
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video
  Prediction
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
Zhen Xing
Qi Dai
Zejia Weng
Zuxuan Wu
Yu-Gang Jiang
VGen
132
14
0
10 Jun 2024
Latent Directions: A Simple Pathway to Bias Mitigation in Generative AI
Latent Directions: A Simple Pathway to Bias Mitigation in Generative AI
Carolina Lopez Olmos
A. Neophytou
Sunando Sengupta
Dim P. Papadopoulos
EGVM
61
2
0
10 Jun 2024
Robust Latent Representation Tuning for Image-text Classification
Robust Latent Representation Tuning for Image-text Classification
Hao Sun
Yu Song
VLM
125
0
0
10 Jun 2024
Synthesizing Efficient Data with Diffusion Models for Person
  Re-Identification Pre-Training
Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training
Ke Niu
Haiyang Yu
X. Qian
Teng Fu
Bin Li
Xiangyang Xue
102
5
0
10 Jun 2024
Vript: A Video Is Worth Thousands of Words
Vript: A Video Is Worth Thousands of Words
Dongjie Yang
Suyuan Huang
Chengqiang Lu
Xiaodong Han
Haoxin Zhang
Yan Gao
Yao Hu
Hai Zhao
VGen
149
31
0
10 Jun 2024
CVQA: Culturally-diverse Multilingual Visual Question Answering
  Benchmark
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
David Romero
Chenyang Lyu
Haryo Akbarianto Wibowo
Teresa Lynn
Injy Hamed
...
Oana Ignat
Joan Nwatu
Rada Mihalcea
Thamar Solorio
Alham Fikri Aji
117
43
0
10 Jun 2024
Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic
  Reasoning Task 2024
Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
Jinwoo Ahn
Junhyeok Park
Min-Jun Kim
Kang-Hyeon Kim
So-Yeong Sohn
Yun-Ji Lee
Du-Seong Chang
Yu-Jung Heo
Eun-Sol Kim
LRM
75
0
0
10 Jun 2024
OmniControlNet: Dual-stage Integration for Conditional Image Generation
OmniControlNet: Dual-stage Integration for Conditional Image Generation
Yilin Wang
Haiyang Xu
Xiang Zhang
Zeyuan Chen
Zhizhou Sha
Zirui Wang
Zhuowen Tu
VLM
88
15
0
09 Jun 2024
SAM-PM: Enhancing Video Camouflaged Object Detection using
  Spatio-Temporal Attention
SAM-PM: Enhancing Video Camouflaged Object Detection using Spatio-Temporal Attention
Muhammad Nawfal Meeran
Gokul Adethya T
Bhanu Pratyush Mantha
87
4
0
09 Jun 2024
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances,
  and Future Directions
A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions
Daizong Liu
Yang Liu
Wencan Huang
Wei Hu
LM&Ro
118
9
0
09 Jun 2024
F-LMM: Grounding Frozen Large Multimodal Models
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu
Sheng Jin
Wenwei Zhang
Lumin Xu
Wentao Liu
Wei Li
Chen Change Loy
MLLM
199
15
0
09 Jun 2024
Integrating Text and Image Pre-training for Multi-modal Algorithmic
  Reasoning
Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
Zijian Zhang
Wei Liu
100
0
0
08 Jun 2024
InstructNav: Zero-shot System for Generic Instruction Navigation in
  Unexplored Environment
InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment
Yuxing Long
Wenzhe Cai
Hongcheng Wang
Guanqi Zhan
Hao Dong
118
34
0
07 Jun 2024
MGIMM: Multi-Granularity Instruction Multimodal Model for
  Attribute-Guided Remote Sensing Image Detailed Description
MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description
Cong Yang
Zuchao Li
Lefei Zhang
74
2
0
07 Jun 2024
LocLLM: Exploiting Generalizable Human Keypoint Localization via Large
  Language Model
LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model
Dongkai Wang
Shiyu Xuan
Shiliang Zhang
LRM
68
6
0
07 Jun 2024
LinkGPT: Teaching Large Language Models To Predict Missing Links
LinkGPT: Teaching Large Language Models To Predict Missing Links
Zhongmou He
Jing Zhu
Shengyi Qian
Joyce Chai
Danai Koutra
LRM
81
2
0
07 Jun 2024
What do MLLMs hear? Examining reasoning with text and sound components
  in Multimodal Large Language Models
What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
Enis Berk Çoban
Michael I. Mandel
Johanna Devaney
AuLLMLRM
86
0
0
07 Jun 2024
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Jianing Yang
Xuweiyi Chen
Nikhil Madaan
Madhavan Iyengar
Shengyi Qian
David Fouhey
Joyce Chai
3DV
165
16
0
07 Jun 2024
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
Eduard Poesina
Adriana Valentina Costache
Adrian-Gabriel Chifu
Josiane Mothe
Radu Tudor Ionescu
VLM
151
1
0
07 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
170
37
0
07 Jun 2024
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and
  Effective for LMMs
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Lingchen Meng
Jianwei Yang
Rui Tian
Xiyang Dai
Zuxuan Wu
Jianfeng Gao
Yu-Gang Jiang
VLM
90
9
0
06 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better
  Captions
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
146
183
0
06 Jun 2024
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D
  Data
DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data
Qihao Liu
Yi Zhang
Song Bai
Adam Kortylewski
Alan Yuille
107
11
0
06 Jun 2024
Understanding Information Storage and Transfer in Multi-modal Large
  Language Models
Understanding Information Storage and Transfer in Multi-modal Large Language Models
Samyadeep Basu
Martin Grayson
C. Morrison
Besmira Nushi
Soheil Feizi
Daniela Massiceti
95
12
0
06 Jun 2024
Exploring the Zero-Shot Capabilities of Vision-Language Models for
  Improving Gaze Following
Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
Anshul Gupta
Pierre Vuillecard
Arya Farkhondeh
J. Odobez
VLM
118
3
0
06 Jun 2024
POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning
  of Large Language Models
POEM: Interactive Prompt Optimization for Enhancing Multimodal Reasoning of Large Language Models
Jianben He
Xingbo Wang
Shiyi Liu
Guande Wu
Claudio Silva
Huamin Qu
LRM
64
3
0
06 Jun 2024
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with
  Multi-Modal Context and Large Language Model
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
Jinlong Xue
Yayue Deng
Yicheng Han
Yingming Gao
Ya Li
95
4
0
06 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
275
17
0
06 Jun 2024
Wings: Learning Multimodal LLMs without Text-only Forgetting
Wings: Learning Multimodal LLMs without Text-only Forgetting
Yi-Kai Zhang
Shiyin Lu
Yang Li
Yanqing Ma
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
VLM
131
10
0
05 Jun 2024
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
Le Zhuo
Ruoyi Du
Han Xiao
Yangguang Li
Dongyang Liu
...
Wanli Ouyang
Ziwei Liu
Ping Luo
Hongsheng Li
Peng Gao
115
58
0
05 Jun 2024
AD-H: Autonomous Driving with Hierarchical Agents
AD-H: Autonomous Driving with Hierarchical Agents
Zaibin Zhang
Shiyu Tang
Yuanhang Zhang
Talas Fu
Yifan Wang
Yang Liu
Dong Wang
Jing Shao
Lijun Wang
H. Lu
89
4
0
05 Jun 2024
Exploiting LMM-based knowledge for image classification tasks
Exploiting LMM-based knowledge for image classification tasks
Maria Tzelepi
Vasileios Mezaris
VLM
67
3
0
05 Jun 2024
Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Balancing Performance and Efficiency in Zero-shot Robotic Navigation
Dmytro Kuzmenko
N. Shvai
LM&Ro
85
0
0
05 Jun 2024
Previous
123...333435...464748
Next