ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,338 papers shown
Title
ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment
ZeShot-VQA: Zero-Shot Visual Question Answering Framework with Answer Mapping for Natural Disaster Damage Assessment
Ehsan Karimi
Maryam Rahnemoonfar
15
0
0
30 May 2025
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Benchmarking Foundation Models for Zero-Shot Biometric Tasks
Redwan Sony
Parisa Farmanifard
Hamzeh Alzwairy
Nitish Shukla
Arun Ross
CVBMVLM
56
0
0
30 May 2025
un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
un2^22CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP
Yinqi Li
Jiahe Zhao
Hong Chang
Ruibing Hou
Shiguang Shan
Xilin Chen
CLIPVLM
43
0
0
30 May 2025
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts
Xin He
Xumeng Han
Longhui Wei
Lingxi Xie
Qi Tian
MoE
41
0
0
30 May 2025
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation
S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation
Yichen Xie
Runsheng Xu
Tong He
Jyh-Jing Hwang
Katie Luo
...
Letian Chen
Yiren Lu
Zhaoqi Leng
Dragomir Anguelov
Mingxing Tan
VLMLRM
42
0
0
30 May 2025
DisTime: Distribution-based Time Representation for Video Large Language Models
DisTime: Distribution-based Time Representation for Video Large Language Models
Yingsen Zeng
Zepeng Huang
Yujie Zhong
Chengjian Feng
Jie Hu
Lin Ma
Yang Liu
VGen
25
0
0
30 May 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
30
0
0
30 May 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
Gen Luo
Ganlin Yang
Ziyang Gong
Guanzhou Chen
Haonan Duan
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Rongrong Ji
X. Zhu
LM&Ro
33
1
0
30 May 2025
When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways
When Large Multimodal Models Confront Evolving Knowledge:Challenges and Pathways
Kailin Jiang
Yuntao Du
Yukai Ding
Yuchen Ren
Ning Jiang
Zhi Gao
Zilong Zheng
Lei Liu
Bin Li
Qing Li
KELM
51
0
0
30 May 2025
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Yaxin Luo
Zhaoyi Li
Jiacheng Liu
Jiacheng Cui
Xiaohan Zhao
Zhiqiang Shen
LLMAGLRMVLM
29
0
0
30 May 2025
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Yuting Zhang
Hao Lu
Qingyong Hu
Yin Wang
Kaishen Yuan
Xin Liu
Kaishun Wu
MLLMLRM
38
0
0
30 May 2025
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
Qingyu Shi
Jinbin Bai
Zhuoran Zhao
Wenhao Chai
Kaidong Yu
...
Shuangyong Song
Yunhai Tong
Xiangtai Li
X. Li
Shuicheng Yan
87
2
0
29 May 2025
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu
Fangfu Liu
Yi-Hsin Hung
Yueqi Duan
LRM
85
1
0
29 May 2025
Are MLMs Trapped in the Visual Room?
Are MLMs Trapped in the Visual Room?
Yazhou Zhang
Chunwang Zou
Qimeng Liu
Lu Rong
Ben Yao
Zheng Lian
Qiuchi Li
Peng Zhang
Jing Qin
AAML
94
0
0
29 May 2025
Vid-SME: Membership Inference Attacks against Large Video Understanding Models
Vid-SME: Membership Inference Attacks against Large Video Understanding Models
Qi Li
Runpeng Yu
Xinchao Wang
27
2
0
29 May 2025
Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation
Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation
Jiahao Cui
Yan Chen
Mingwang Xu
Hanlin Shang
Yuxuan Chen
Yun Zhan
Zilong Dong
Yao Yao
Jingdong Wang
Siyu Zhu
DiffMVGen
64
0
0
29 May 2025
Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration
Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration
Zeying Gong
Rong Li
Tianshuai Hu
Ronghe Qiu
Lingdong Kong
Lingfeng Zhang
Yiyi Ding
Leying Zhang
Junwei Liang
53
0
0
29 May 2025
A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis
Shengyuan Liu
Boyun Zheng
Wenting Chen
Zhihao Peng
Zhenfei Yin
Jing Shao
Jiancong Hu
Yixuan Yuan
ELM
84
0
0
29 May 2025
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition
Yu Li
Jin Jiang
J. Zhu
Shuai Peng
Baole Wei
Yuxuan Zhou
Liangcai Gao
53
0
0
29 May 2025
HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image
Junyi Guo
Jingxuan Zhang
Fangyu Wu
Huanda Lu
Qiufeng Wang
Wenmian Yang
Eng Gee Lim
Dongming Lu
DiffM
17
0
0
29 May 2025
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
Chenhao Zheng
Jieyu Zhang
Mohammadreza Salehi
Ziqi Gao
Vishnu Iyengar
Norimasa Kobori
Quan Kong
Ranjay Krishna
28
0
0
29 May 2025
Multi-Sourced Compositional Generalization in Visual Question Answering
Multi-Sourced Compositional Generalization in Visual Question Answering
Chuanhao Li
Wenbo Ye
Zhen Li
Yuwei Wu
Yunde Jia
CoGe
63
0
0
29 May 2025
TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
Keren Ye
Ignacio Garcia Dorado
Michalis Raptis
M. Delbracio
Irene Zhu
P. Milanfar
Hossein Talebi
31
0
0
29 May 2025
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Sungjune Park
Hyunjun Kim
Junho Kim
S. T. Kim
Y. Ro
LRM
123
0
0
29 May 2025
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning
Jiaqi Huang
Zunnan Xu
Jun Zhou
Ting Liu
Yicheng Xiao
Mingwen Ou
Bowen Ji
Xiu Li
Kehong Yuan
VLM
89
0
0
28 May 2025
VidText: Towards Comprehensive Evaluation for Video Text Understanding
VidText: Towards Comprehensive Evaluation for Video Text Understanding
Zhoufaran Yang
Yan Shu
Zhifei Yang
Yan Zhang
Yu-Hong Li
K. Lu
Gangyan Zeng
Shaohui Liu
Yu Zhou
N. Sebe
CoGe
54
0
0
28 May 2025
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Ce Zhang
Kaixin Ma
Tianqing Fang
Wenhao Yu
Hongming Zhang
Zhisong Zhang
Yaqi Xie
Katia Sycara
Haitao Mi
Dong Yu
VLM
98
0
0
28 May 2025
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
Jinhong Ni
Chang-Bin Zhang
Qiang Zhang
Jing Zhang
MDE
57
1
0
28 May 2025
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying
Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying
Youze Xue
Dian Li
Gang Liu
23
0
0
28 May 2025
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Kaiyu Yue
Vasu Singla
Menglin Jia
John Kirchenbauer
Rifaa Qadri
Zikui Cai
A. Bhatele
Furong Huang
Tom Goldstein
VLM
66
0
0
28 May 2025
Open-Det: An Efficient Learning Framework for Open-Ended Detection
Open-Det: An Efficient Learning Framework for Open-Ended Detection
Guiping Cao
Tao Wang
Wenjian Huang
X. Lan
Jianguo Zhang
D. Jiang
ObjDVLM
22
0
0
27 May 2025
PIPE: Physics-Informed Position Encoding for Alignment of Satellite Images and Time Series
PIPE: Physics-Informed Position Encoding for Alignment of Satellite Images and Time Series
Haobo Li
Eunseo Jung
Zixin Chen
Zhaowei Wang
Yueya Wang
Huamin Qu
Alexis Kai Hon Lau
8
0
0
27 May 2025
RefAV: Towards Planning-Centric Scenario Mining
RefAV: Towards Planning-Centric Scenario Mining
Cainan Davidson
Deva Ramanan
Neehar Peri
89
2
0
27 May 2025
Compositional Scene Understanding through Inverse Generative Modeling
Compositional Scene Understanding through Inverse Generative Modeling
Yanbo Wang
Justin Dauwels
Yilun Du
OCL
76
0
0
27 May 2025
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia
Sensen Gao
Simeng Qin
Tianyu Pang
C. Du
Yihao Huang
Xinfeng Li
Yiming Li
Bo Li
Yang Liu
AAML
44
0
0
27 May 2025
QuARI: Query Adaptive Retrieval Improvement
QuARI: Query Adaptive Retrieval Improvement
Eric Xing
Abby Stylianou
Robert Pless
Nathan Jacobs
VLM
27
0
0
27 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
65
0
0
27 May 2025
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Shurong Zheng
Fan Yang
Ming Tang
Jinqiao Wang
VLMLRM
54
0
0
27 May 2025
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
P. Zhang
Yifei Su
Pengyuan Wu
Dong An
Li Zhang
Zhigang Wang
Dong Wang
Yan Ding
Bin Zhao
Xuelong Li
LM&Ro
81
0
0
27 May 2025
Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design
Scan-and-Print: Patch-level Data Summarization and Augmentation for Content-aware Layout Generation in Poster Design
HsiaoYuan Hsu
Yuxin Peng
DiffM
21
0
0
27 May 2025
Text-Queried Audio Source Separation via Hierarchical Modeling
Text-Queried Audio Source Separation via Hierarchical Modeling
Xinlei Yin
Xiulian Peng
Xue Jiang
Zhiwei Xiong
Yan Lu
54
0
0
27 May 2025
From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
Maximilian Dreyer
Lorenz Hufe
J. Berend
Thomas Wiegand
Sebastian Lapuschkin
Wojciech Samek
42
0
0
26 May 2025
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)
Subba Reddy Oota
Akshett Rai Jindal
Ishani Mondal
Khushbu Pahwa
Satya Sai Srinath Namburi
Manish Shrivastava
M. Singh
Bapi S. Raju
Manish Gupta
43
1
0
26 May 2025
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Pooneh Mousavi
Yingzhi Wang
Mirco Ravanelli
Cem Subakan
AuLLM
62
0
0
26 May 2025
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang
Changle Zhou
Jiawei Kong
Kuofeng Gao
Bin Chen
Tao Liang
Guojun Ma
Shu-Tao Xia
MLLM
115
0
0
26 May 2025
Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
Knowledge-Aligned Counterfactual-Enhancement Diffusion Perception for Unsupervised Cross-Domain Visual Emotion Recognition
Wen Yin
Yong Wang
Guiduo Duan
Dongyang Zhang
Xin Hu
Yuan-Fang Li
Tao He
125
0
0
26 May 2025
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
Rong-Cheng Tu
Zhao Jin
Jingyi Liao
Xiao Luo
Yingjie Wang
Li Shen
Dacheng Tao
113
0
0
26 May 2025
Can Visual Encoder Learn to See Arrows?
Can Visual Encoder Learn to See Arrows?
Naoyuki Terashita
Yusuke Tozaki
Hideaki Omote
Congkha Nguyen
Ryosuke Nakamoto
Yuta Koreeda
Hiroaki Ozaki
14
0
0
26 May 2025
Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift
Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift
Gihoon Kim
Hyungjin Park
Taesup Kim
DiffMVLM
197
0
0
26 May 2025
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
Shuang Ao
Flora D. Salim
Simon Khan
LLMAGLM&Ro
37
0
0
26 May 2025
Previous
123456...454647
Next