ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,347 papers shown
Title
Beyond Pixels: Text Enhances Generalization in Real-World Image
  Restoration
Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration
Haoze Sun
Wenbo Li
Qingbin Liu
Kaiwen Zhou
Yongqiang Chen
Yong Guo
Yunshui Li
Renjing Pei
Long Peng
Yue Yang
DiffM
113
1
0
01 Dec 2024
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal
  Alignment
AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment
Yan Li
Yifei Xing
X. Lan
Xuzhao Li
Haifeng Chen
D. Jiang
Mamba
141
1
0
01 Dec 2024
Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis
Sketch-Guided Motion Diffusion for Stylized Cinemagraph Synthesis
H. Jin
Hengyuan Chang
Xiaoxuan Xie
Zhengyang Wang
Xusheng Du
Shaojun Hu
H. Xie
DiffMVGen
107
0
0
01 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
227
2
0
01 Dec 2024
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Xubing Ye
Yukang Gan
Yixiao Ge
Xiao Zhang
Yansong Tang
167
11
0
30 Nov 2024
Advancing Myopia To Holism: Fully Contrastive Language-Image
  Pre-training
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
Haicheng Wang
Chen Ju
Weixiong Lin
Shuai Xiao
Mengting Chen
...
Mingshuai Yao
Jinsong Lan
Ying Chen
Qingwen Liu
Yanfeng Wang
VLMCLIP
121
4
0
30 Nov 2024
ForgerySleuth: Empowering Multimodal Large Language Models for Image
  Manipulation Detection
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Zhihao Sun
Haoran Jiang
Haoran Chen
Yixin Cao
Xipeng Qiu
Zuxuan Wu
Yu-Gang Jiang
136
2
0
29 Nov 2024
ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model
Kunyang Han
Yibo Hu
Mengxue Qu
Hailin Shi
Yao Zhao
Y. X. Wei
MLLMVLM3DV
271
1
0
29 Nov 2024
On Domain-Adaptive Post-Training for Multimodal Large Language Models
On Domain-Adaptive Post-Training for Multimodal Large Language Models
Daixuan Cheng
Shaohan Huang
Ziyu Zhu
Xintong Zhang
Wayne Xin Zhao
Zhongzhi Luan
Bo Dai
Zhenliang Zhang
VLM
165
5
0
29 Nov 2024
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
Hui Li
Mingwang Xu
Yun Zhan
Shan Mu
Jiaye Li
...
Yukang Chen
Tan Chen
Mao Ye
Jingdong Wang
Siyu Zhu
VGen
210
7
0
28 Nov 2024
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
Chancharik Mitra
Brandon Huang
Tianning Chai
Zhiqiu Lin
Assaf Arbelle
Rogerio Feris
Leonid Karlinsky
Trevor Darrell
Deva Ramanan
Roei Herzig
VLM
393
4
0
28 Nov 2024
Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for
  Robust 3D Robotic Manipulation
Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation
Yueru Jia
Jiaming Liu
Sixiang Chen
Chenyang Gu
Zihan Wang
...
Lily Lee
Pengwei Wang
Zhongyuan Wang
Renrui Zhang
Shanghang Zhang
174
19
0
27 Nov 2024
HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion
  Prior
HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior
Li-Yuan Tsao
Hao-Wei Chen
Hao-Wei Chung
Deqing Sun
Chun-Yi Lee
Kelvin Chan
Ming-Hsuan Yang
DiffM
99
4
0
27 Nov 2024
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding
  with Superior Temporal Localization Ability
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
Shimin Chen
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLMMLLM
159
17
0
27 Nov 2024
When Large Vision-Language Models Meet Person Re-Identification
When Large Vision-Language Models Meet Person Re-Identification
Qizao Wang
Bin Li
Xiangyang Xue
142
5
0
27 Nov 2024
Training Data Synthesis with Difficulty Controlled Diffusion Model
Training Data Synthesis with Difficulty Controlled Diffusion Model
Zerun Wang
Jiafeng Mao
Xueting Wang
Toshihiko Yamasaki
DiffM
122
0
0
27 Nov 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object
  Interaction Analysis
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLMVLM
150
0
0
27 Nov 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Qing Jiang
Gen Luo
Yuqin Yang
Yuda Xiong
Yihao Chen
Zhaoyang Zeng
Tianhe Ren
Lei Zhang
VLMLRM
216
10
0
27 Nov 2024
Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models
Autonomous Imagination: Closed-Loop Decomposition of Visual-to-Textual Conversion in Visual Reasoning for Multimodal Large Language Models
Qingbin Liu
Yumeng Li
Boyuan Xiao
Yichang Jian
Ziang Qin
Tianjia Shao
Yao-Xiang Ding
Kun Zhou
LRMMLLM
206
3
0
27 Nov 2024
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
Jiaxuan Li
Junwen Mo
MinhDuc Vo
Akihiro Sugimoto
Hideki Nakayama
182
0
0
26 Nov 2024
HyperSeg: Towards Universal Visual Segmentation with Large Language
  Model
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
Cong Wei
Yujie Zhong
Haoxian Tan
Yong Liu
Zheng Zhao
Jie Hu
Yujiu Yang
VOSMLLMVLMLRM
136
6
0
26 Nov 2024
CoA: Chain-of-Action for Generative Semantic Labels
CoA: Chain-of-Action for Generative Semantic Labels
Meng Wei
Zhongnian Li
Peng Ying
Xinzheng Xu
VLM
119
0
0
26 Nov 2024
InsightEdit: Towards Better Instruction Following for Image Editing
InsightEdit: Towards Better Instruction Following for Image Editing
Yingjing Xu
Jie Kong
Jiazhi Wang
Xiao Pan
Bo Lin
Qiang Liu
DiffM
128
1
0
26 Nov 2024
Efficient Multi-modal Large Language Models via Visual Token Grouping
Efficient Multi-modal Large Language Models via Visual Token Grouping
Minbin Huang
Runhui Huang
Han Shi
Yimeng Chen
Chuanyang Zheng
Xiangguo Sun
Xin Jiang
Zhiyu Li
Hong Cheng
VLM
162
4
0
26 Nov 2024
Exploring Aleatoric Uncertainty in Object Detection via Vision
  Foundation Models
Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models
Peng Cui
Guande He
Dan Zhang
Zhijie Deng
Yinpeng Dong
Jun Zhu
177
1
0
26 Nov 2024
COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection
Jinqi Xiao
S. Sang
Tiancheng Zhi
Jing Liu
Qing Yan
Linjie Luo
Bo Yuan
Bo Yuan
VLM
210
2
0
26 Nov 2024
GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration
GenDeg: Diffusion-based Degradation Synthesis for Generalizable All-In-One Image Restoration
Sudarshan Rajagopalan
Nithin Gopalakrishnan Nair
Jay N. Paranjape
Vishal M. Patel
DiffM
169
1
0
26 Nov 2024
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
Andong Deng
Zhongpai Gao
Anwesa Choudhuri
Benjamin Planche
Meng Zheng
Bin Wang
Terrence Chen
Chong Chen
Ziyan Wu
AI4TS
135
1
0
25 Nov 2024
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual
  Knowledge
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Yaqi Zhao
Yuanyang Yin
Lin Li
Mingan Lin
Victor Shea-Jay Huang
Siwei Chen
Xin Wu
Baoqun Yin
Guosheng Dong
Wentao Zhang
136
1
0
25 Nov 2024
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation
Jungeun Kim
Hyeongwoo Jeon
Jongseong Bae
Ha Young Kim
SLR
122
0
0
25 Nov 2024
UVCG: Leveraging Temporal Consistency for Universal Video Protection
UVCG: Leveraging Temporal Consistency for Universal Video Protection
KaiZhou Li
Jindong Gu
Xinchun Yu
Junjie Cao
Yansong Tang
Xiao-Ping Zhang
AAML
121
0
0
25 Nov 2024
SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous
  Driving with Synthetic Data from Latent Diffusion Models
SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models
Harsh Goel
Sai Shankar Narasimhan
Oguzhan Akcin
Sandeep Chinchali
DiffM
116
2
0
25 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities
  through Tree-Based Image Exploration
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Yuxiang Cai
149
8
0
25 Nov 2024
Generative Omnimatte: Learning to Decompose Video into Layers
Generative Omnimatte: Learning to Decompose Video into Layers
Yao-Chih Lee
Erika Lu
Sarah Rumbley
Michal Geyer
Jia-Bin Huang
Tali Dekel
Forrester Cole
DiffMVGen
203
8
0
25 Nov 2024
VideoOrion: Tokenizing Object Dynamics in Videos
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
171
2
0
25 Nov 2024
Chain of Attack: On the Robustness of Vision-Language Models Against
  Transfer-Based Adversarial Attacks
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Peng Xie
Yequan Bie
Jianda Mao
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen
AAML
116
1
0
24 Nov 2024
Imagine and Seek: Improving Composed Image Retrieval with an Imagined
  Proxy
Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy
Yuchen Li
Fan Ma
Yi Yang
188
3
0
24 Nov 2024
PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation
Ziyao Zeng
Jingcheng Ni
Daniel Wang
Patrick Rim
Younjoon Chung
Fengyu Yang
Byung-Woo Hong
A. Wong
DiffMMDE
289
2
0
24 Nov 2024
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Qifan Yu
Wei Chow
Zhongqi Yue
Kaihang Pan
Yang Wu
Xiaoyang Wan
Juncheng Billy Li
Siliang Tang
Hao Zhang
Yueting Zhuang
DiffM
236
29
0
24 Nov 2024
freePruner: A Training-free Approach for Large Multimodal Model
  Acceleration
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
Bingxin Xu
Yuzhang Shang
Yunhao Ge
Qian Lou
Yan Yan
138
3
0
23 Nov 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
  Want at Any Granularity
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGeVLM
164
8
0
23 Nov 2024
LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces
LAGUNA: LAnguage Guided UNsupervised Adaptation with structured spaces
Anxhelo Diko
Antonino Furnari
Luigi Cinque
G. Farinella
363
0
0
23 Nov 2024
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator
Chaehun Shin
Jooyoung Choi
Heeseung Kim
Sungroh Yoon
DiffM
175
13
0
23 Nov 2024
ReWind: Understanding Long Videos with Instructed Learnable Memory
ReWind: Understanding Long Videos with Instructed Learnable Memory
Anxhelo Diko
Tinghuai Wang
Wassim Swaileh
Shiyan Sun
Ioannis Patras
KELMVLM
158
1
0
23 Nov 2024
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object
  Hallucination in Large Vision-Language Models
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Junzhe Chen
Tianshu Zhang
Shijie Huang
Yuwei Niu
Linfeng Zhang
Lijie Wen
Xuming Hu
MLLMVLM
503
6
0
22 Nov 2024
AnyText2: Visual Text Generation and Editing With Customizable
  Attributes
AnyText2: Visual Text Generation and Editing With Customizable Attributes
Yuxiang Tuo
Yifeng Geng
Liefeng Bo
VLM
147
10
0
22 Nov 2024
Adversarial Prompt Distillation for Vision-Language Models
Adversarial Prompt Distillation for Vision-Language Models
Lin Luo
Xin Wang
Bojia Zi
Shihao Zhao
Xingjun Ma
Yu-Gang Jiang
AAMLVLM
182
4
0
22 Nov 2024
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual
  Token Compression
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
Yuke Zhu
Chi Xie
Shuang Liang
Bo Zheng
Sheng Guo
148
11
0
21 Nov 2024
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided
  Visual Prompts
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts
Honglin Li
Yuting Gao
Chenglu Zhu
Jingdong Chen
M. Yang
Lin Yang
MLLM
205
0
0
21 Nov 2024
Learning to Reason Iteratively and Parallelly for Complex Visual
  Reasoning Scenarios
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLMLRM
138
2
0
20 Nov 2024
Previous
123...181920...454647
Next