ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
The Compressor-Retriever Architecture for Language Model OS
The Compressor-Retriever Architecture for Language Model OS
Yuan Yang
Siheng Xiong
Ehsan Shareghi
Faramarz Fekri
RALMKELM
89
1
0
02 Sep 2024
Understanding Multimodal Hallucination with Parameter-Free
  Representation Alignment
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
Yueqian Wang
Jianxin Liang
Yuxuan Wang
Huishuai Zhang
Dongyan Zhao
112
1
0
02 Sep 2024
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive
  Content Generation
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Qihua Chen
Yi Ma
Haobo Wang
Junkun Yuan
Wenzhe Zhao
Q. Tian
Hongmei Wang
Shaobo Min
Qifeng Chen
Wen Liu
DiffM
110
21
0
02 Sep 2024
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring
  Expression Segmentation
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
Yi-Chia Chen
Wei-Hua Li
Cheng Sun
Yu-Chiang Frank Wang
Chu-Song Chen
VLM
108
21
0
01 Sep 2024
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models
Y. Guo
Faizan Siddiqui
Yang Zhao
Rama Chellappa
Shao-Yuan Lo
LRM
109
5
0
31 Aug 2024
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene
  Understanding
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
Yonghui Wang
Wengang Zhou
Hao Feng
Houqiang Li
VLM
70
1
0
30 Aug 2024
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths
  Vision Computation
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Shiwei Wu
Joya Chen
Kevin Qinghong Lin
Qimeng Wang
Yan Gao
Qianli Xu
Tong Xu
Yao Hu
Enhong Chen
Mike Zheng Shou
VLM
86
14
0
29 Aug 2024
GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative
  Models
GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Xingqian Xu
Humphrey Shi
N. Sebe
103
0
0
29 Aug 2024
DriveGenVLM: Real-world Video Generation for Vision Language Model based
  Autonomous Driving
DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving
Yongjie Fu
Anmol Jain
Xuan Di
Xu Chen
Zhaobin Mo
VGen
110
6
0
29 Aug 2024
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
...
Debing Liu
Bin Xu
Juanzi Li
Yuxiao Dong
Jie Tang
VLMMLLM
118
121
0
29 Aug 2024
WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
Mohan Li
Cong-Thanh Do
Simon Keizer
Youmna Farag
Svetlana Stoyanchev
R. Doddipatla
80
2
0
29 Aug 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age
  of Rising Multi-Modal Large Language Models
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
K. Nakata
Daisuke Miyashita
Youyang Ng
Yasuto Hoshi
J. Deguchi
66
0
0
29 Aug 2024
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in
  Vision-Language Models
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
Jingyi Wang
Jianzhong Ju
Jian Luan
Zhidong Deng
VLM
97
2
0
29 Aug 2024
Training-free Video Temporal Grounding using Large-scale Pre-trained
  Models
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Minghang Zheng
Xinhao Cai
Qingchao Chen
Yuxin Peng
Yang Liu
77
5
0
29 Aug 2024
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
  Models for Trait Discovery from Biological Images
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
M. Maruf
Arka Daw
Kazi Sajeed Mehrab
Harish Babu Manogaran
Abhilash Neog
...
Wei-Lun Chao
Charles V. Stewart
T. Berger-Wolf
Wasila Dahdul
Anuj Karpatne
CoGe
91
4
0
28 Aug 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu
Yue Liao
Le Zhuo
Chenning Xu
Guanghao Zhang
...
Bolin Li
Zhelun Yu
Si Liu
Hongsheng Li
Hao Jiang
VLMMoE
72
18
0
28 Aug 2024
Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data
  Generation Toolkit for Auditing 3D Human Pose Estimators
Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators
Nikita Kister
István Sárándi
Anna Khoreva
Gerard Pons-Moll
154
0
0
28 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MAELMLRM
120
26
0
28 Aug 2024
BELT-2: Bootstrapping EEG-to-Language representation alignment for
  multi-task brain decoding
BELT-2: Bootstrapping EEG-to-Language representation alignment for multi-task brain decoding
Jinzhao Zhou
Yiqun Duan
Fred Chang
T. Do
Yu-Kai Wang
Chin-Teng Lin
72
5
0
28 Aug 2024
Squid: Long Context as a New Modality for Energy-Efficient On-Device
  Language Models
Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Wei Chen
Zhiyuan Li
Shuo Xin
Yihao Wang
104
5
0
28 Aug 2024
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training,
  Finetuning, and Evaluating Aerospace Embodied World Models
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models
Fanglong Yao
Yuanchang Yue
Youzhi Liu
Xian Sun
Kun Fu
VGenEgoV
66
8
0
28 Aug 2024
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Min Shi
Fuxiao Liu
Shihao Wang
Shijia Liao
Subhashree Radhakrishnan
...
Andrew Tao
Andrew Tao
Zhiding Yu
Guilin Liu
Guilin Liu
MLLM
155
68
0
28 Aug 2024
Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language
  Instruction Tuning for Semiconductor Electron Micrograph Analysis
Parameter-Efficient Quantized Mixture-of-Experts Meets Vision-Language Instruction Tuning for Semiconductor Electron Micrograph Analysis
Sakhinana Sagar Srinivas
Chidaksh Ravuru
Geethan Sannidhi
Venkataramana Runkana
84
0
0
27 Aug 2024
From Bias to Balance: Detecting Facial Expression Recognition Biases in
  Large Multimodal Foundation Models
From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models
Kaylee Chhua
Zhoujinyi Wen
Vedant Hathalia
Kevin Zhu
Sean O'Brien
61
3
0
27 Aug 2024
CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View
  Synthesis
CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis
Weijia Li
Jun He
Junyan Ye
Huaping Zhong
Zhimeng Zheng
Zilong Huang
Dahua Lin
Conghui He
88
7
0
27 Aug 2024
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge
Xu Zhang
Yang Zheng
Kaitai Guo
Jimin Liang
177
2
0
27 Aug 2024
A Survey of Camouflaged Object Detection and Beyond
A Survey of Camouflaged Object Detection and Beyond
Fengyang Xiao
Sujie Hu
Yuqi Shen
Chengyu Fang
Jinfa Huang
Chunming He
Longxiang Tang
Ziyun Yang
Xiu Li
105
13
0
26 Aug 2024
Revisiting Image Captioning Training Paradigm via Direct CLIP-based
  Optimization
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
Nicholas Moratelli
Davide Caffagni
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
CLIP
97
3
0
26 Aug 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Yiwei Ma
Jiayi Ji
Ke Ye
Weihuang Lin
Zhibin Wang
Yonghan Zheng
Qiang-feng Zhou
Xiaoshuai Sun
Rongrong Ji
130
11
0
26 Aug 2024
Video-CCAM: Enhancing Video-Language Understanding with Causal
  Cross-Attention Masks for Short and Long Videos
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
Jiajun Fei
Dian Li
Zhidong Deng
Zekun Wang
Gang Liu
Hui Wang
VLM
85
43
0
26 Aug 2024
Evaluating Attribute Comprehension in Large Vision-Language Models
Evaluating Attribute Comprehension in Large Vision-Language Models
Haiwen Zhang
Zixi Yang
Yuanzhi Liu
Xinran Wang
Zheqi He
Kongming Liang
Zhanyu Ma
ELM
63
0
0
25 Aug 2024
Making Large Language Models Better Planners with Reasoning-Decision
  Alignment
Making Large Language Models Better Planners with Reasoning-Decision Alignment
Zhijian Huang
Tao Tang
Shaoxiang Chen
Sihao Lin
Zequn Jie
Lin Ma
Guangrun Wang
Xiaodan Liang
148
15
0
25 Aug 2024
FungiTastic: A multi-modal dataset and benchmark for image categorization
FungiTastic: A multi-modal dataset and benchmark for image categorization
Lukás Picek
Klara Janouskova
Milan Šulc
Jirí Matas
150
1
0
24 Aug 2024
A New Era in Computational Pathology: A Survey on Foundation and
  Vision-Language Models
A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models
Dibaloke Chanda
Milan Aryal
Nasim Yahya Soltani
Masoud Ganji
AI4CEVLM
139
7
0
23 Aug 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare?
  A Comprehensive Survey
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin
Yifan Zhu
Xin Mei
Ling Huang
Jingying Ma
Kai He
Zhen Peng
Min Zhang
Mengling Feng
111
23
0
23 Aug 2024
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
...
Qingsong Wen
Zhang Zhang
Liwen Wang
Rong Jin
Tieniu Tan
OffRL
176
52
0
23 Aug 2024
Building and better understanding vision-language models: insights and
  future directions
Building and better understanding vision-language models: insights and future directions
Hugo Laurençon
Andrés Marafioti
Victor Sanh
Léo Tronchon
VLM
138
78
0
22 Aug 2024
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
FIDAVL: Fake Image Detection and Attribution using Vision-Language Model
Mamadou Keita
W. Hamidouche
Hessen Bougueffa Eutamene
Abdelmalik Taleb-Ahmed
Abdenour Hadid
VLM
130
1
0
22 Aug 2024
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework
  for Multimodal Large Language Model
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
Chaoya Jiang
Jia Hongrui
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
VLM
60
2
0
22 Aug 2024
Unlocking Attributes' Contribution to Successful Camouflage: A Combined
  Textual and VisualAnalysis Strategy
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and VisualAnalysis Strategy
Hong Zhang
Yixuan Lyu
Qian Yu
Hanyang Liu
Huimin Ma
Ding Yuan
Yifan Yang
106
3
0
22 Aug 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual
  Integration in MLLMs
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
Yaqi Zhao
Yajie Zhang
Ke Lin
Jiahao Wang
Xin Tao
Pengfei Wan
Di Zhang
Baoqun Yin
Wentao Zhang
LRM
111
9
0
21 Aug 2024
Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using
  Omnidirectional Camera and Multiple Vision-Language Models
Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
Naoto Tsukamoto
Kei Okada
Masayuki Inaba
LM&Ro
65
0
0
21 Aug 2024
Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant
Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant
Guofeng Mei
Luigi Riz
Yiming Wang
Fabio Poiesi
ISegVLM
131
4
0
20 Aug 2024
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing
  Hallucinations in LVLMs
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLMMLLM
107
25
0
19 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLMSLR
70
3
0
19 Aug 2024
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous
  Driving
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving
Hidehisa Arai
Keita Miwa
Kento Sasaki
Yu Yamaguchi
Kohei Watanabe
Shunsuke Aoki
Issei Yamamoto
107
14
0
19 Aug 2024
Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large
  Language Model Augmented Framework
Pedestrian Attribute Recognition: A New Benchmark Dataset and A Large Language Model Augmented Framework
Jiandong Jin
Xiao Wang
Qian Zhu
Haiyang Wang
Chenglong Li
VLM
70
5
0
19 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language
  Compositionality
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
66
0
0
18 Aug 2024
Barbie: Text to Barbie-Style 3D Avatars
Barbie: Text to Barbie-Style 3D Avatars
Xiaokun Sun
Zhenyu Zhang
Ying Tai
Qian Wang
Hao Tang
Zili Yi
Jian Yang
LM&Ro
118
2
0
17 Aug 2024
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted
  Attack for Image-to-Text Models
Ask, Attend, Attack: A Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models
Qingyuan Zeng
Zhenzhong Wang
Yiu-ming Cheung
Min Jiang
AAML
88
2
0
16 Aug 2024
Previous
123...262728...464748
Next