ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,345 papers shown
Title
Large Language Models for Video Surveillance Applications
Ulindu De Silva
Leon Fernando
Billy Lau Pik Lik
Zann Koh
Sam Conrad Joyce
Belinda Yuen
Chau Yuen
46
1
0
06 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Yang Liu
126
0
0
06 Jan 2025
Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors
Minglin Chen
Longguang Wang
Sheng Ao
Ye Zhang
Kai Xu
Yulan Guo
DiffM
478
0
0
05 Jan 2025
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang
Zhemeng Yu
Gabriele Spadaro
Chen Ju
Victor Quétu
Enzo Tartaglione
Enzo Tartaglione
VLM
433
6
0
05 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
128
0
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
363
59
0
03 Jan 2025
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Lijie Tao
Han Zhang
Haizhao Jing
Yu Liu
Kelu Yao
Guoting Wei
Xizhe Xue
147
0
0
03 Jan 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Wenhu Chen
MLLMVLM
199
29
0
03 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffMVLM
139
0
0
03 Jan 2025
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang
Xue Jiang
Zhiqiang Wang
Wentao Mo
Xi Xiao
Bo Han
Yongjie Yin
Feng Zheng
AAML
158
4
0
02 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
229
16
0
02 Jan 2025
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Yuanpeng Tu
Hao Luo
Xi Chen
S. Ji
Xiang Bai
Hengshuang Zhao
DiffMVGen
162
6
0
02 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
173
2
0
01 Jan 2025
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue
Quan Deng
Fei Yu
Yanhao Wang
Jun Wang
Yongqian Li
VLM
129
5
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
134
26
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
284
5
0
31 Dec 2024
ChartAdapter: Large Vision-Language Model for Chart Summarization
ChartAdapter: Large Vision-Language Model for Chart Summarization
Peixin Xu
Yujuan Ding
Wenqi Fan
93
2
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILawLM&MALRM
149
30
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
190
42
0
31 Dec 2024
UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity
UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity
Jingbo Lin
Zhilu Zhang
Wenbo Li
Renjing Pei
Hang Xu
Hongzhi Zhang
Wangmeng Zuo
116
1
0
28 Dec 2024
When SAM2 Meets Video Shadow and Mirror Detection
When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
VLM
85
1
0
26 Dec 2024
Improving Generated and Retrieved Knowledge Combination Through
  Zero-shot Generation
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
146
2
0
25 Dec 2024
PRISM: Efficient Long-Range Reasoning With Short-Context LLMs
PRISM: Efficient Long-Range Reasoning With Short-Context LLMs
Dulhan Jayalath
James Bradley Wendt
Nicholas Monath
Sandeep Tata
Beliz Gunel
CLLLRM
74
0
0
25 Dec 2024
Unveiling Visual Perception in Language Models: An Attention Head
  Analysis Approach
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi
Junjia Guo
Yunlong Tang
Lianggong Wen
Zhang Liu
Chenliang Xu
52
6
0
24 Dec 2024
Personalized Large Vision-Language Models
Personalized Large Vision-Language Models
Chau Pham
Hoang Phan
David Doermann
Yunjie Tian
VLM
117
4
0
23 Dec 2024
Multimodal Preference Data Synthetic Alignment with Reward Model
Multimodal Preference Data Synthetic Alignment with Reward Model
Robert Wijaya
Ngoc-Bao Nguyen
Ngai-Man Cheung
MLLMSyDa
133
4
0
23 Dec 2024
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
Se Jin Park
Yeonju Kim
Hyeongseop Rha
Bella Godiva
Y. Ro
76
1
0
23 Dec 2024
Neural-MCRL: Neural Multimodal Contrastive Representation Learning for EEG-based Visual Decoding
Neural-MCRL: Neural Multimodal Contrastive Representation Learning for EEG-based Visual Decoding
Yueyang Li
Zijian Kang
Shengyu Gong
Wenhao Dong
Weiming Zeng
Hongjie Yan
W. Siok
Nizhuan Wang
119
2
0
23 Dec 2024
VidTwin: Video VAE with Decoupled Structure and Dynamics
VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang
Junliang Guo
Xinyi Xie
Tianyu He
Xu Sun
Li Zhao
DRLVGen
163
5
0
23 Dec 2024
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large
  Language Models
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models
Yeyuan Wang
D. Gao
Bin Li
Rujiao Long
Lei Yi
Xiaoyan Cai
Libin Yang
Jinxia Zhang
Shanqing Yu
Qi Xuan
107
1
0
22 Dec 2024
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question
  Answering and Object Localization
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham
Hoang-Nam Le
Phu-Vinh Nguyen
Chris Ngo
Truong-Son Hy
AuLLMLRM
144
1
0
21 Dec 2024
Reframing Image Difference Captioning with BLIP2IDC and Synthetic
  Augmentation
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
Gautier Evennou
Antoine Chaffin
Vivien Chappelier
Ewa Kijak
DiffM
125
0
0
20 Dec 2024
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Zhiqiang Tang
Zihan Zhong
Tong He
Gerald Friedland
166
1
0
19 Dec 2024
Dataset Augmentation by Mixing Visual Concepts
Dataset Augmentation by Mixing Visual Concepts
Abdullah Al Rahat
Hemanth Venkateswara
DiffM
116
0
0
19 Dec 2024
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic
  Segmentation
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
J. Zhang
Li Zhang
Shijian Li
VLM
177
0
0
18 Dec 2024
A Review of Multimodal Explainable Artificial Intelligence: Past,
  Present and Future
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Shilin Sun
Wenbin An
Feng Tian
Fang Nan
Qidong Liu
Jing Liu
N. Shah
Ping Chen
157
6
0
18 Dec 2024
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal
  Large Language Models
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei
Yujie Zhong
Haoxian Tan
Yingsen Zeng
Yong Liu
Zheng Zhao
Yujiu Yang
MLLMVLMVOS
152
3
0
18 Dec 2024
What makes a good metric? Evaluating automatic metrics for text-to-image
  consistency
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adriana Romero Soriano
Adina Williams
165
4
0
18 Dec 2024
LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications
  Achieving High Spectrum Efficiency
LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications Achieving High Spectrum Efficiency
Achintha Wijesinghe
Suchinthaka Wanninayaka
Weiwei Wang
Yu-Chieh Chao
Songyang Zhang
Zhi Ding
132
1
0
18 Dec 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang
Shusheng Yang
Anjali W. Gupta
Rilyn Han
Li Fei-Fei
Saining Xie
LRM
212
107
0
18 Dec 2024
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
317
7
0
18 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh
Hwanhee Jung
Jong Wook Kim
Seanie Lee
Innfarn Yoo
Andreas Lugmayr
Seunggeun Chi
K. Ramani
Sangpil Kim
3DGS
171
2
0
17 Dec 2024
OmniPrism: Learning Disentangled Visual Concept for Image Generation
OmniPrism: Learning Disentangled Visual Concept for Image Generation
Yangyang Li
Daqing Liu
Wu Liu
Allen He
Xinchen Liu
Yongdong Zhang
Guoqing Jin
DiffMCoGe
100
0
0
16 Dec 2024
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Renqiu Xia
Mingxing Li
Hancheng Ye
Wenjie Wu
Hongbin Zhou
...
Zeang Sheng
Botian Shi
Tao Chen
Junchi Yan
Bo Zhang
196
10
0
16 Dec 2024
StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair
  Geometric Priors
StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors
Xiaokun Sun
Zeyu Cai
Zhenyu Zhang
Ying Tai
Jian Yang
133
0
0
16 Dec 2024
CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution
CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution
Bingwen Hu
Heng Liu
Zhedong Zheng
Ping Liu
SupR
263
0
0
16 Dec 2024
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu
Yixin Chen
Junfeng Ni
Baoxiong Jia
Yu Liu
Diwen Wan
Gang Zeng
Siyuan Huang
DiffM
235
4
0
16 Dec 2024
GaussianProperty: Integrating Physical Properties to 3D Gaussians with
  LMMs
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu
Wenhang Ge
Dicong Qiu
ZhiFei Chen
Dongyu Yan
...
Haoyu Zhao
HanFeng Zhao
Shunsi Zhang
Junwei Liang
Ying-Cong Chen
3DGS
126
1
0
15 Dec 2024
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for
  Training-Free Zero-Shot Composed Image Retrieval
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Yuanmin Tang
Xiaoting Qin
Jing Zhang
Jing Yu
Gaopeng Gou
Gang Xiong
Qingwei Ling
Saravan Rajmohan
Dongmei Zhang
Qi Wu
LRM
113
1
0
15 Dec 2024
Previous
123...161718...454647
Next