ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
Enhancing Multimodal Large Language Models with Multi-instance Visual
  Prompt Generator for Visual Representation Enrichment
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Wenliang Zhong
Wenyi Wu
Qi Li
Rob Barton
Boxin Du
Shioulin Sam
Karim Bouyarmane
Ismail B. Tutar
Junzhou Huang
92
3
0
05 Jun 2024
GraphAlign: Pretraining One Graph Neural Network on Multiple Graphs via
  Feature Alignment
GraphAlign: Pretraining One Graph Neural Network on Multiple Graphs via Feature Alignment
Zhenyu Hou
Haozhan Li
Yukuo Cen
Jie Tang
Yuxiao Dong
95
8
0
05 Jun 2024
Inv-Adapter: ID Customization Generation via Image Inversion and
  Lightweight Adapter
Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter
Peng-Fei Xing
Ning Wang
Jianbo Ouyang
Zechao Li
DiffM
72
1
0
05 Jun 2024
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Zicheng Zhang
H. Wu
Chunyi Li
Yingjie Zhou
Wei Sun
Xiongkuo Min
Zijian Chen
Xiaohong Liu
Weisi Lin
Guangtao Zhai
EGVM
148
18
0
05 Jun 2024
Item-Language Model for Conversational Recommendation
Item-Language Model for Conversational Recommendation
Li Yang
Anushya Subbiah
Hardik Patel
Judith Yue Li
Yanwei Song
Reza Mirghaderi
Vikram Aggarwal
Qifan Wang
KELM
94
5
0
05 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal
  Learning
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
101
5
0
04 Jun 2024
V-Express: Conditional Dropout for Progressive Training of Portrait
  Video Generation
V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation
Cong Wang
Kuan Tian
Jun Zhang
Yonghang Guan
Feng Luo
Fei Shen
Zhiwei Jiang
Qing Gu
Xiao Han
Wei Yang
129
45
0
04 Jun 2024
Why Only Text: Empowering Vision-and-Language Navigation with
  Multi-modal Prompts
Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
Haodong Hong
Sen Wang
Zi Huang
Qi Wu
Jiajun Liu
109
4
0
04 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose
  Audio-Language Representation
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
102
7
0
04 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
117
16
0
04 Jun 2024
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
Inkyu Shin
Qihang Yu
Xiaohui Shen
In So Kweon
KuK-Jin Yoon
Liang-Chieh Chen
VGenDiffM
127
1
0
04 Jun 2024
Parrot: Multilingual Visual Instruction Tuning
Parrot: Multilingual Visual Instruction Tuning
Hai-Long Sun
Da-Wei Zhou
Yangfu Li
Shiyin Lu
Chao Yi
...
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
MLLM
163
12
0
04 Jun 2024
L-MAGIC: Language Model Assisted Generation of Images with Coherence
L-MAGIC: Language Model Assisted Generation of Images with Coherence
Zhipeng Cai
Matthias Mueller
R. Birkl
Diana Wofk
Shaoyen Tseng
JunDa Cheng
Gabriela Ben-Melech Stan
Vasudev Lal
Michael Paulitsch
DiffMMLLM
85
6
0
03 Jun 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
An-Chieh Cheng
Hongxu Yin
Yang Fu
Qiushan Guo
Ruihan Yang
Jan Kautz
Xiaolong Wang
Sifei Liu
LRM
120
75
0
03 Jun 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets
ELSA: Evaluating Localization of Social Activities in Urban Streets
Maryam Hosseini
Marco Cipriano
Sedigheh Eslami
Daniel Hodczak
Liu Liu
Andres Sevtsuk
Gerard de Melo
67
0
0
03 Jun 2024
Unleashing Generalization of End-to-End Autonomous Driving with
  Controllable Long Video Generation
Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation
Enhui Ma
Lijun Zhou
Tao Tang
Zhan Zhang
Dong Han
...
Peng Jia
Xianpeng Lang
Haiyang Sun
Di Lin
Kaicheng Yu
VGen
114
28
0
03 Jun 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept
  Synergy
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
100
24
0
03 Jun 2024
Towards Practical Single-shot Motion Synthesis
Towards Practical Single-shot Motion Synthesis
Konstantinos Roditakis
Spyridon Thermos
N. Zioulis
VGen
121
0
0
03 Jun 2024
MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
Vahid Azizi
Fatemeh Koochaki
VLM
112
0
0
03 Jun 2024
Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding
  Alignment Approach for Healthcare Applications
Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications
David Restrepo
Chenwei Wu
Sebastián Andrés Cajas
Luis Filipe Nakayama
Leo Anthony Celi
Diego M. Lopez
66
3
0
02 Jun 2024
Image Captioning via Dynamic Path Customization
Image Captioning via Dynamic Path Customization
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Yiyi Zhou
Xiaopeng Hong
Yongjian Wu
Rongrong Ji
81
1
0
01 Jun 2024
Artemis: Towards Referential Understanding in Complex Videos
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLMVGen
90
10
0
01 Jun 2024
Query2CAD: Generating CAD models using natural language queries
Query2CAD: Generating CAD models using natural language queries
Akshay Badagabettu
Sai Sravan Yarlagadda
A. Farimani
81
15
0
31 May 2024
Empowering Visual Creativity: A Vision-Language Assistant to Image
  Editing Recommendations
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations
Tiancheng Shen
Jun Hao Liew
Long Mai
Lu Qi
Jiashi Feng
Jiaya Jia
DiffM
60
2
0
31 May 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image
  Perception, Comprehension, and Beyond
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
76
7
0
31 May 2024
Hard Cases Detection in Motion Prediction by Vision-Language Foundation
  Models
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
Yi Yang
Qingwen Zhang
Kei Ikemura
Nazre Batool
John Folkesson
VLM
77
2
0
31 May 2024
DeCo: Decoupling Token Compression from Semantic Abstraction in
  Multimodal Large Language Models
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao
Lei Li
Shuhuai Ren
Lean Wang
Yuanxin Liu
Xu Sun
Lu Hou
76
34
0
31 May 2024
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models
Sijin Chen
Xin Chen
Anqi Pang
Xianfang Zeng
Wei Cheng
...
C. Zhang
Jingyi Yu
Gang Yu
Bin-Bin Fu
Tao Chen
AI4CE
135
43
0
31 May 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits
  Multimodal Reasoning
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLMLRM3DV
115
7
0
31 May 2024
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Shiyin Lu
Yang Li
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Han-Jia Ye
VLMMLLM
144
55
0
31 May 2024
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced
  Visual Understanding
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
Huaxiang Zhang
Yaojia Mu
Guo-Niu Zhu
Zhongxue Gan
83
2
0
31 May 2024
Joint Embeddings for Graph Instruction Tuning
Joint Embeddings for Graph Instruction Tuning
Vlad Argatu
Aaron Haag
Oliver Lohse
93
0
0
31 May 2024
Information Theoretic Text-to-Image Alignment
Information Theoretic Text-to-Image Alignment
Chao Wang
Giulio Franzese
A. Finamore
Massimo Gallo
Pietro Michiardi
176
0
0
31 May 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLMMLLM
185
421
0
31 May 2024
Is Synthetic Data all We Need? Benchmarking the Robustness of Models
  Trained with Synthetic Images
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
Krishnakant Singh
Thanush Navaratnam
Jannik Holmer
Simone Schaub-Meyer
Stefan Roth
DiffM
99
21
0
30 May 2024
Visual Perception by Large Language Model's Weights
Visual Perception by Large Language Model's Weights
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
69
8
0
30 May 2024
VividDream: Generating 3D Scene with Ambient Dynamics
VividDream: Generating 3D Scene with Ambient Dynamics
Yao-Chih Lee
Yi-Ting Chen
Andrew Wang
Ting-Hsuan Liao
Brandon Y. Feng
Jia-Bin Huang
VGen
82
12
0
30 May 2024
LLMGeo: Benchmarking Large Language Models on Image Geolocation
  In-the-wild
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild
Zhiqiang Wang
Dejia Xu
Rana Muhammad Shahroz Khan
Yanbin Lin
Zhiwen Fan
Xingquan Zhu
77
4
0
30 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action
  Anticipation using Large Video-Language Models
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
121
18
0
30 May 2024
NoiseBoost: Alleviating Hallucination with Noise Perturbation for
  Multimodal Large Language Models
NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models
Kai Wu
Boyuan Jiang
Zhengkai Jiang
Qingdong He
Donghao Luo
Shengzhi Wang
Qingwen Liu
Chengjie Wang
VLMMLLM
115
4
0
30 May 2024
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen
Han Zhang
Zhantao Yang
Hao Chen
Kai Hu
Marios Savvides
ObjDVLM
89
5
0
30 May 2024
Instruction-Guided Visual Masking
Instruction-Guided Visual Masking
Jinliang Zheng
Jianxiong Li
Si Cheng
Yinan Zheng
Jiaming Li
Jihao Liu
Yu Liu
Jingjing Liu
Xianyuan Zhan
138
7
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image
  Comprehension
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Zou
Kai-Wei Chang
Wei Wang
SyDaVLMLRM
100
46
0
30 May 2024
Source Code Foundation Models are Transferable Binary Analysis Knowledge
  Bases
Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases
Zian Su
Xiangzhe Xu
Ziyang Huang
Kaiyuan Zhang
Xiangyu Zhang
86
8
0
30 May 2024
Don't drop your samples! Coherence-aware training benefits Conditional diffusion
Don't drop your samples! Coherence-aware training benefits Conditional diffusion
Nicolas Dufour
Victor Besnier
Vicky Kalogeiton
David Picard
DiffM
135
2
0
30 May 2024
Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models
Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiayan Yang
Jinhao Duan
Yichi Wang
...
Qiang Zhang
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
142
10
0
30 May 2024
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Qianqi Yan
Xuehai He
Xiang Yue
Xin Eric Wang
LM&MA
139
12
0
30 May 2024
CLIPLoss and Norm-Based Data Selection Methods for Multimodal
  Contrastive Learning
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Alex Fang
Wenjing Zhou
Kevin Jamieson
S. Du
104
9
0
29 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLMVLM
86
35
0
29 May 2024
Video Anomaly Detection in 10 Years: A Survey and Outlook
Video Anomaly Detection in 10 Years: A Survey and Outlook
Moshira Abdalla
Sajid Javed
Muaz Al Radi
Anwaar Ulhaq
Naoufel Werghi
93
5
0
29 May 2024
Previous
123...343536...464748
Next