ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
Tell Codec What Worth Compressing: Semantically Disentangled Image
  Coding for Machine with LMMs
Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs
Jinming Liu
Yuntao Wei
Junyan Lin
Shengyang Zhao
Heming Sun
Zhibo Chen
Wenjun Zeng
Xin Jin
137
2
0
16 Aug 2024
Beyond the Hype: A dispassionate look at vision-language models in medical scenario
Beyond the Hype: A dispassionate look at vision-language models in medical scenario
Yang Nan
Huichi Zhou
Xiaodan Xing
Guang Yang
105
4
0
16 Aug 2024
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue
Manli Shu
Anas Awadalla
Jun Wang
An Yan
...
Zeyuan Chen
Silvio Savarese
Juan Carlos Niebles
Caiming Xiong
Ran Xu
VLM
108
96
0
16 Aug 2024
Visual Agents as Fast and Slow Thinkers
Visual Agents as Fast and Slow Thinkers
Guangyan Sun
Mingyu Jin
Zhenting Wang
Cheng-Long Wang
Siqi Ma
Qifan Wang
Ying Nian Wu
Ying Nian Wu
Dongfang Liu
Dongfang Liu
LLMAGLRM
228
19
0
16 Aug 2024
VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object
  Localization Probability Maps
VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps
Senthil Hariharan Arul
Dhruva Kumar
Vivek Sugirtharaj
Richard Kim
Xuewei
Qi
R. Madhivanan
Arnie Sen
Dinesh Manocha
30
1
0
15 Aug 2024
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing
  Speech-Image Retrieval
Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval
Lifeng Zhou
Yuke Li
Rui Deng
Yuting Yang
Haoqi Zhu
74
0
0
15 Aug 2024
End-to-end Semantic-centric Video-based Multimodal Affective Computing
End-to-end Semantic-centric Video-based Multimodal Affective Computing
Ronghao Lin
Ying Zeng
Sijie Mai
Haifeng Hu
VGen
120
0
0
14 Aug 2024
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a
  Hierarchical Benchmark
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark
Minxuan Zhou
Hao Liang
Tianpeng Li
Zhiyu Wu
Mingan Lin
...
Yujing Qiao
Weipeng Chen
Bin Cui
Wentao Zhang
Guosheng Dong
129
5
0
14 Aug 2024
Connecting Dreams with Visual Brainstorming Instruction
Connecting Dreams with Visual Brainstorming Instruction
Yasheng Sun
Bohan Li
Mingchen Zhuge
Deng-Ping Fan
Salman Khan
Fahad Shahbaz Khan
Hideki Koike
DiffM
70
0
0
14 Aug 2024
Vision Language Model for Interpretable and Fine-grained Detection of
  Safety Compliance in Diverse Workplaces
Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces
Zhiling Chen
Hanning Chen
Mohsen Imani
Ruimin Chen
Farhad Imani
34
3
0
13 Aug 2024
Do Vision-Language Foundational models show Robust Visual Perception?
Do Vision-Language Foundational models show Robust Visual Perception?
Shivam Chandhok
P. Tandon
VLMOOD
21
0
0
13 Aug 2024
Response Wide Shut: Surprising Observations in Basic Vision Language
  Model Capabilities
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities
Shivam Chandhok
Wan-Cyuan Fan
Leonid Sigal
VLMMLLM
65
4
0
13 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan O. Arik
Tejas Nama
Tomas Pfister
81
1
0
13 Aug 2024
UniPortrait: A Unified Framework for Identity-Preserving Single- and
  Multi-Human Image Personalization
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He
Yifeng Geng
Liefeng Bo
DiffM
119
23
0
12 Aug 2024
BI-MDRG: Bridging Image History in Multimodal Dialogue Response
  Generation
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation
Hee Suk Yoon
Eunseop Yoon
Joshua Tian Jin Tee
Kang Zhang
Yu-Jung Heo
Du-Seong Chang
Chang D. Yoo
90
5
0
12 Aug 2024
Revisiting Multi-Modal LLM Evaluation
Revisiting Multi-Modal LLM Evaluation
Jian Lu
Shikhar Srivastava
Junyu Chen
Robik Shrestha
Manoj Acharya
Kushal Kafle
Christopher Kanan
73
3
0
09 Aug 2024
Hyperbolic Learning with Multimodal Large Language Models
Hyperbolic Learning with Multimodal Large Language Models
Paolo Mandica
Luca Franco
Konstantinos Kallidromitis
Suzanne Petryk
Fabio Galasso
85
3
0
09 Aug 2024
Instruction Tuning-free Visual Token Complement for Multimodal LLMs
Instruction Tuning-free Visual Token Complement for Multimodal LLMs
Dongsheng Wang
Jiequan Cui
Miaoge Li
Wang Lin
Bo Chen
Hanwang Zhang
MLLM
50
4
0
09 Aug 2024
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
  Large Language Models
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye
Haiyang Xu
Haowei Liu
Anwen Hu
Ming Yan
Qi Qian
Ji Zhang
Fei Huang
Jingren Zhou
MLLMVLM
102
139
0
09 Aug 2024
Enhancing Journalism with AI: A Study of Contextualized Image Captioning
  for News Articles using LLMs and LMMs
Enhancing Journalism with AI: A Study of Contextualized Image Captioning for News Articles using LLMs and LMMs
Aliki Anagnostopoulou
Thiago S. Gouvêa
Daniel Sonntag
83
2
0
08 Aug 2024
How Well Can Vision Language Models See Image Details?
How Well Can Vision Language Models See Image Details?
Chenhui Gou
Abdulwahab Felemban
Faizan Farooq Khan
Deyao Zhu
Jianfei Cai
Hamid Rezatofighi
Mohamed Elhoseiny
VLMMLLM
100
5
0
07 Aug 2024
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
  Open-domain Visual Storytelling
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling
Zilyu Ye
Yu Lei
Ruotian Peng
Jinjin Cao
Zhiyang Chen
...
Mingyuan Zhou
Xiaoqian Shen
Mohamed Elhoseiny
Nan Zhuang
Guo-Jun Qi
VGenVLM
76
1
0
07 Aug 2024
D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion
  Methods
D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods
Onkar Susladkar
Gayatri Deshmukh
Sparsh Mittal
Parth Shastri
DiffM
91
3
0
07 Aug 2024
AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging
AgentsCoMerge: Large Language Model Empowered Collaborative Decision Making for Ramp Merging
Senkang Hu
Zhengru Fang
Zihan Fang
Yiqin Deng
Xianhao Chen
Yuguang Fang
Sam Kwong
162
15
0
07 Aug 2024
Attacks and Defenses for Generative Diffusion Models: A Comprehensive
  Survey
Attacks and Defenses for Generative Diffusion Models: A Comprehensive Survey
V. T. Truong
Luan Ba Dang
Long Bao Le
DiffMMedIm
116
19
0
06 Aug 2024
Multistain Pretraining for Slide Representation Learning in Pathology
Multistain Pretraining for Slide Representation Learning in Pathology
Guillaume Jaume
Anurag J. Vaidya
Andrew Zhang
Andrew H. Song
Richard J. Chen
S. Sahai
Dandan Mo
Emilio Madrigal
L. Le
Faisal Mahmood
117
14
0
05 Aug 2024
GazeXplain: Learning to Predict Natural Language Explanations of Visual
  Scanpaths
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
Xianyu Chen
Ming Jiang
Qi Zhao
72
3
0
05 Aug 2024
Latent-INR: A Flexible Framework for Implicit Representations of Videos
  with Discriminative Semantics
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
Shishira R. Maiya
Anubhav Gupta
M. Gwilliam
Max Ehrlich
Abhinav Shrivastava
82
3
1
05 Aug 2024
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced
  by Event Knowledge Graph
Towards Coarse-grained Visual Language Navigation Task Planning Enhanced by Event Knowledge Graph
Zhao Kaichen
Song Yaoxian
Zhao Haiquan
Liu Haoyu
Li Tiefeng
Li Zhixu
81
0
0
05 Aug 2024
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
  Modules for Compositional Visual Reasoning
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
Yanjie Wang
Alan Yuille
Zhuowan Li
Zilong Zheng
LRM
123
5
0
05 Aug 2024
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial
  Contrastive Prompt Tuning
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning
Xin Wang
Kai-xiang Chen
Xingjun Ma
Zhineng Chen
Jingjing Chen
Yu-Gang Jiang
AAML
110
5
0
04 Aug 2024
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Fushuo Huo
Wenchao Xu
Zhong Zhang
Yining Qi
Zhicheng Chen
Peilin Zhao
VLMMLLM
212
31
0
04 Aug 2024
A Comprehensive Review of Multimodal Large Language Models: Performance
  and Challenges Across Different Tasks
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
Jiaqi Wang
Hanqi Jiang
Yi-Hsueh Liu
Chong Ma
Xu-Yao Zhang
...
Xin Zhang
Wei Zhang
Dinggang Shen
Tianming Liu
Shu Zhang
VLMAI4TS
111
36
0
02 Aug 2024
The Phantom Menace: Unmasking Privacy Leakages in Vision-Language Models
The Phantom Menace: Unmasking Privacy Leakages in Vision-Language Models
Simone Caldarella
Massimiliano Mancini
Elisa Ricci
Rahaf Aljundi
PILM
78
2
0
02 Aug 2024
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
Qian Zhang
Xiangzi Dai
Ninghua Yang
Xiang An
Ziyong Feng
Xingyu Ren
VLMCLIP
122
22
0
02 Aug 2024
Actra: Optimized Transformer Architecture for Vision-Language-Action
  Models in Robot Learning
Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning
Yueen Ma
Dafeng Chi
Shiguang Wu
Yuecheng Liu
Yuzheng Zhuang
Jianye Hao
Irwin King
69
5
0
02 Aug 2024
Dissecting Dissonance: Benchmarking Large Multimodal Models Against
  Self-Contradictory Instructions
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Jin Gao
Lei Gan
Yuankai Li
Yixin Ye
Dequan Wang
73
3
0
02 Aug 2024
Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal
  Large Language Models
Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models
Afia Anjum
Xiang Liu
Zhaoxiang Liu
Ning Wang
Shiguo Lian
VLMMLLM
57
0
0
02 Aug 2024
Text-Guided Video Masked Autoencoder
Text-Guided Video Masked Autoencoder
D. Fan
Jue Wang
Shuai Liao
Zhikang Zhang
Vimal Bhat
Xinyu Li
VGen
57
3
0
01 Aug 2024
SynesLM: A Unified Approach for Audio-visual Speech Recognition and
  Translation via Language Model and Synthetic Data
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
Yichen Lu
Álvaro Huertas-García
Xuankai Chang
Hengwei Bian
Soumi Maiti
Shinji Watanabe
93
2
0
01 Aug 2024
Are Bigger Encoders Always Better in Vision Large Models?
Are Bigger Encoders Always Better in Vision Large Models?
Bozhou Li
Hao Liang
Zimo Meng
Wentao Zhang
VLM
79
3
0
01 Aug 2024
Mitigating Multilingual Hallucination in Large Vision-Language Models
Mitigating Multilingual Hallucination in Large Vision-Language Models
Xiaoye Qu
Mingyang Song
Xiaoye Qu
Jianfeng Dong
Yu Cheng
VLMLRM
90
2
0
01 Aug 2024
OmniParser for Pure Vision Based GUI Agent
OmniParser for Pure Vision Based GUI Agent
Yadong Lu
Jianwei Yang
Yelong Shen
Ahmed Hassan Awadallah
MLLM
95
53
0
01 Aug 2024
UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
Jiayuan Zhu
Yunli Qi
Yongqiang Chen
Nan Yin
Zhen Wang
Quanming Yao
125
11
0
01 Aug 2024
WAS: Dataset and Methods for Artistic Text Segmentation
WAS: Dataset and Methods for Artistic Text Segmentation
Xudong Xie
Yuzhe Li
Yang Liu
Zhifei Zhang
Zhaowen Wang
Wei Xiong
Xiang Bai
DiffM
92
2
0
31 Jul 2024
Paying More Attention to Image: A Training-Free Method for Alleviating
  Hallucination in LVLMs
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
Shiping Liu
Kecheng Zheng
Wei Chen
MLLM
114
53
0
31 Jul 2024
Learning Video Context as Interleaved Multimodal Sequences
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
108
6
0
31 Jul 2024
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large
  Language Models
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
Ming-Kuan Wu
Xinyue Cai
Jiayi Ji
Jiale Li
Oucheng Huang
Gen Luo
Hao Fei
Xiaoshuai Sun
Rongrong Ji
MLLM
162
13
0
31 Jul 2024
PEAR: Phrase-Based Hand-Object Interaction Anticipation
PEAR: Phrase-Based Hand-Object Interaction Anticipation
Zichen Zhang
Hongcheng Luo
Wei Zhai
N. A. Ushakov
Yu Kang
99
6
0
31 Jul 2024
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented
  Generation via Knowledge-enhanced Reranking and Noise-injected Training
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
Rivik Setty
Chengjin Xu
Vinay Setty
Jian Guo
87
13
0
31 Jul 2024
Previous
123...272829...464748
Next