ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,340 papers shown
Title
Window Token Concatenation for Efficient Visual Large Language Models
Window Token Concatenation for Efficient Visual Large Language Models
Yifan Li
Wentao Bao
Botao Ye
Zhen Tan
Tianlong Chen
Huan Liu
Yu Kong
VLM
103
0
0
05 Apr 2025
SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
Yimin Wei
Aoran Xiao
Yexian Ren
Yuting Zhu
Hongruixuan Chen
J. Xia
Naoto Yokoya
VLM
126
0
0
04 Apr 2025
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving
Kexin Tian
Jingrui Mao
Yu Zhang
Jiwan Jiang
Yang Zhou
Zhengzhong Tu
CoGe
143
5
0
04 Apr 2025
Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles
Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles
Chen Wei Kuo
Kevin Chu
Nouar Aldahoul
Hazem Ibrahim
Talal Rahwan
Yasir Zaki
SyDa
156
0
0
04 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
124
4
0
03 Apr 2025
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning
Xianwei Zhuang
Yuxin Xie
Yufan Deng
Dongchao Yang
Liming Liang
Jinghan Ru
Yuguo Yin
Yuexian Zou
160
5
0
03 Apr 2025
Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation
Marine Saliency Segmenter: Object-Focused Conditional Diffusion with Region-Level Semantic Knowledge Distillation
Laibin Chang
Yunke Wang
JiaXing Huang
Longxiang Deng
Bo Du
Chang Xu
DiffM
130
0
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian Guan
Wei Wu
Rui Yan
VLM
209
3
0
03 Apr 2025
DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data
Junjie Wu
Jiangtao Xie
Zhaolin Zhang
Qilong Wang
Q. Hu
P. Li
Sen Xu
VLM
92
0
0
02 Apr 2025
On Data Synthesis and Post-training for Visual Abstract Reasoning
On Data Synthesis and Post-training for Visual Abstract Reasoning
Ke Zhu
Y. Wang
Jiangjiang Liu
Qunyi Xie
Shanshan Liu
Gang Zhang
SyDaLRM
92
0
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
99
1
0
02 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLMAuLLM
120
2
0
02 Apr 2025
WorldPrompter: Traversable Text-to-Scene Generation
WorldPrompter: Traversable Text-to-Scene Generation
Zhaoyang Zhang
Yannick Hold-Geoffroy
Jian Yang
Chen Ziwen
Fujun Luan
Julie Dorsey
Yiwei Hu
VGen
125
0
0
02 Apr 2025
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
110
0
0
02 Apr 2025
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Junlong Ren
Hao Wang
125
0
0
02 Apr 2025
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization
Chaohu Liu
Tianyi Gui
Yu Liu
Linli Xu
VLMAAML
126
1
0
02 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
95
0
0
02 Apr 2025
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang
Yucheng Zhao
Tiancai Wang
Haoqiang Fan
Xinming Zhang
Zhaoxiang Zhang
155
4
0
02 Apr 2025
Prompting Forgetting: Unlearning in GANs via Textual Guidance
Prompting Forgetting: Unlearning in GANs via Textual Guidance
Piyush Nagasubramaniam
Neeraj Karamchandani
Chen Wu
Sencun Zhu
DiffMAILawMU
87
0
0
01 Apr 2025
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Ting Liu
Siyuan Li
97
0
0
01 Apr 2025
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation
Lanyun Zhu
Tianrun Chen
Qianxiong Xu
Xuanyi Liu
Deyi Ji
Haiyang Wu
De Wen Soh
Jing Liu
VLMLRM
86
1
0
01 Apr 2025
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Bosung Kim
Kyuhwan Lee
Isu Jeong
Jungmin Cheon
Yeojin Lee
Seulki Lee
VGen
102
0
0
31 Mar 2025
Style Quantization for Data-Efficient GAN Training
Style Quantization for Data-Efficient GAN Training
Jian Wang
Xin Lan
Jizhe Zhou
Yuxin Tian
Jiancheng Lv
94
0
0
31 Mar 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
Yubo Zhang
Pedro Botelho
Trevor Gordon
Gil Zussman
I. Kadota
84
0
0
31 Mar 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
Slava Voloshynovskiy
DiffM
107
0
0
31 Mar 2025
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
Kai Huang
Hao Zou
Bochen Wang
Ye Xi
Zhen Xie
Hao Wang
VLM
93
0
0
31 Mar 2025
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
Yijie Zheng
Bangjun Xiao
Lei Shi
Xiaoyang Li
Faming Wu
Tianyu Li
Xuefeng Xiao
Yanzhe Zhang
Yansen Wang
Shouda Liu
MLLMMoE
141
1
0
31 Mar 2025
JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation
JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation
Fangda Chen
Shanshan Zhao
Chuanfu Xu
Long Lan
VGen
91
2
0
31 Mar 2025
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning
Zhenyang Liu
Yikai Wang
Sixiao Zheng
Tongying Pan
Longfei Liang
Yanwei Fu
Xiangyang Xue
LRM
87
0
0
30 Mar 2025
Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments
Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments
Yifan Xu
V. Kamat
Carol Menassa
111
0
0
29 Mar 2025
Shape and Texture Recognition in Large Vision-Language Models
Shape and Texture Recognition in Large Vision-Language Models
Sagi Eppel
Mor Bismut
Alona Faktor
3DVVLM
97
2
0
29 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang
Hongxi Yan
Qiqi Zhan
Shuai Yang
Mingming Zhang
Yiming Lei
Chenkai Zhang
Zeming Liu
Qingjie Liu
Yansen Wang
147
2
0
28 Mar 2025
Understanding Co-speech Gestures in-the-wild
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
140
0
0
28 Mar 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
J. Huang
Baoxiong Jia
Yansen Wang
Ziyu Zhu
Xiongkun Linghu
Qing Li
Song-Chun Zhu
Siyuan Huang
175
5
0
28 Mar 2025
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Xiaomin Yu
Pengxiang Ding
Donglin Wang
Siteng Huang
Songyang Gao
Chengwei Qin
Kejian Wu
Zhaoxin Fan
Ziyue Qiao
Donglin Wang
MLLMSyDa
107
1
0
28 Mar 2025
Learning to Instruct for Visual Instruction Tuning
Learning to Instruct for Visual Instruction Tuning
Zhihan Zhou
Feng Hong
Jiaan Luo
Jiangchao Yao
Dongsheng Li
Bo Han
Yize Zhang
Yanfeng Wang
VLM
114
1
0
28 Mar 2025
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning
Dailan He
Xinyu Wang
Shulun Wang
Guanglu Song
Bingqi Ma
Hao Shao
Y. Liu
Hongsheng Li
DiffM
110
0
0
28 Mar 2025
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
Hengyuan Zhao
Ziqin Wang
Qixin Sun
Kaiyou Song
Yilin Li
Xiaolin Hu
Qingpei Guo
Si Liu
KELMCLLMoE
152
1
0
27 Mar 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
458
0
0
27 Mar 2025
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing
Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing
Fan Qi
Yu Duan
Changsheng Xu
DiffM
89
0
0
27 Mar 2025
A Unified Image-Dense Annotation Generation Model for Underwater Scenes
A Unified Image-Dense Annotation Generation Model for Underwater Scenes
Hongkai Lin
Dingkang Liang
Zhenghao Qi
X. Bai
DiffM
82
0
0
27 Mar 2025
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs
Xiaoqin Wang
Xusen Ma
Xianxu Hou
Meidan Ding
Yudong Li
Junliang Chen
Wenting Chen
Xiaoyang Peng
LinLin Shen
CVBM
131
0
0
27 Mar 2025
On Large Multimodal Models as Open-World Image Classifiers
On Large Multimodal Models as Open-World Image Classifiers
Alessandro Conti
Massimiliano Mancini
Enrico Fini
Yiming Wang
Paolo Rota
Elisa Ricci
VLM
Presented at ResearchTrend Connect | VLM on 07 May 2025
197
1
0
27 Mar 2025
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li
Meng Tian
Zhenyu Lin
Jiangtong Zhu
Dechang Zhu
Haiqiang Liu
Zining Wang
Yueyi Zhang
Zhiwei Xiong
Xinhai Zhao
CoGeVLM
144
1
0
27 Mar 2025
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Erika Mori
Yue Qiu
Hirokatsu Kataoka
Y. Aoki
71
0
0
27 Mar 2025
FakeReasoning: Towards Generalizable Forgery Detection and Reasoning
FakeReasoning: Towards Generalizable Forgery Detection and Reasoning
Y. Gao
Dongliang Chang
Bingyao Yu
Haotian Qin
Lei Chen
Kongming Liang
Zhanyu Ma
105
1
0
27 Mar 2025
FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval
FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval
Zixu Li
Zhiheng Fu
Yupeng Hu
Zhiwei Chen
Haokun Wen
Liqiang Nie
123
1
0
27 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
Fahad Shahbaz Khan
428
0
0
27 Mar 2025
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
Dongchen Lu
Yuyao Sun
Zilu Zhang
Leping Huang
Jianliang Zeng
Mao Shu
Huo Cao
140
4
0
27 Mar 2025
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation
Reza Qorbani
Gianluca Villani
Theodoros Panagiotakopoulos
Marc Botet Colomer
Linus Harenstam-Nielsen
...
Pier Luigi Dovesi
Jussi Karlgren
Daniel Cremers
F. Tombari
Matteo Poggi
VLM
101
0
0
27 Mar 2025
Previous
123...8910...454647
Next