ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,345 papers shown
Title
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski
Teodora Popordanoska
Matthew B. Blaschko
Tinne Tuytelaars
92
0
0
12 Mar 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin
Zhiyu Tan
Mengping Yang
Xiaomeng Yang
Hao Li
178
0
0
12 Mar 2025
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Ali Vosoughi
Dimitra Emmanouilidou
H. Gamper
131
1
0
12 Mar 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin Qinghong Lin
Mike Zheng Shou
VGen
477
1
0
12 Mar 2025
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents
Arman Zharmagambetov
Chuan Guo
Ivan Evtimov
Maya Pavlova
Ruslan Salakhutdinov
Kamalika Chaudhuri
LLMAG
141
8
0
12 Mar 2025
Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Yucheng Suo
Fan Ma
Kaixin Shen
Linchao Zhu
Yi Yang
VLM
88
0
0
12 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
128
0
0
12 Mar 2025
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
Julian Spravil
Sebastian Houben
Sven Behnke
VLM
197
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
517
2
0
12 Mar 2025
PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling
Nikolai Korber
Eduard Kromer
Andreas Siebert
S. Hauke
Daniel Mueller-Gritschneder
Björn Schuller
94
0
0
12 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
N. Sebe
VOSRALMVLM
119
5
0
12 Mar 2025
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Rui Yang
Lin Song
Yicheng Xiao
Runhui Huang
Yixiao Ge
Ying Shan
Hengshuang Zhao
MLLM
108
3
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
375
0
0
12 Mar 2025
Teaching LMMs for Image Quality Scoring and Interpreting
Zicheng Zhang
H. Wu
Ziheng Jia
Weisi Lin
Guangtao Zhai
129
2
0
12 Mar 2025
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Haoyu Zhang
Qiaohui Chu
Meng Liu
Yunxiao Wang
Bin Wen
Fan Yang
Yan Li
Di Zhang
Yaowei Wang
Liqiang Nie
EgoV
112
5
0
12 Mar 2025
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
Beitao Chen
Xinyu Lyu
Lianli Gao
Jingkuan Song
Jikang Cheng
195
1
0
11 Mar 2025
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
Jialv Zou
Bencheng Liao
Qian Zhang
Wenyu Liu
Xinggang Wang
MambaMLLM
149
1
0
11 Mar 2025
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework
Zhuo Zhi
Chen Feng
Adam Daneshmend
Mine Orlu
Andreas Demosthenous
L. Yin
Da Li
Ziquan Liu
Miguel R. D. Rodrigues
LRM
121
1
0
11 Mar 2025
Multimodal Generation of Animatable 3D Human Models with AvatarForge
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
101
0
0
11 Mar 2025
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Muzhi Zhu
Yuzhuo Tian
Hao Chen
Chunluan Zhou
Qingpei Guo
Yongxu Liu
M. Yang
Chunhua Shen
MLLMVLM
120
1
0
11 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind
EgoBlind: Towards Egocentric Visual Assistance for the Blind
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoVVLM
130
0
0
11 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
493
3
0
11 Mar 2025
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu
Sipeng Zheng
Börje F. Karlsson
Zongqing Lu
62
0
0
10 Mar 2025
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Xiang Liu
Zhaoxiang Liu
Huan Hu
Zezhou Chen
Kohou Wang
Ning Wang
Kai Wang
88
1
0
10 Mar 2025
Should VLMs be Pre-trained with Image Data?
Sedrick Scott Keh
Jean Mercat
S. Gadre
Kushal Arora
Igor Vasiljevic
...
Shuran Song
Russ Tedrake
Thomas Kollar
Ludwig Schmidt
Achal Dave
VLM
106
0
0
10 Mar 2025
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng
Gongrui Zhang
Miaosen Zhang
Zhiyuan You
Jie Liu
Qipeng Zhu
Kai Yang
Xingzhong Xu
Xin Geng
Xu Yang
LRMReLM
242
88
0
10 Mar 2025
Towards Fine-Grained Video Question Answering
Wei Dai
Alan Luo
Zane Durante
Debadutta Dash
Arnold Milstein
Kevin Schulman
Ehsan Adeli
L. Fei-Fei
109
1
0
10 Mar 2025
Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion
Yongle Zhang
Yimin Liu
Qiang Wu
DiffM
90
0
0
10 Mar 2025
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
Bo Jiang
Shaoyu Chen
Qian Zhang
Wenyu Liu
Xinggang Wang
OffRLLRMVLM
161
12
0
10 Mar 2025
AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models
Bo Huang
Wenlun Xu
Qizhuo Han
Haodong Jing
Ying Li
DiffM
94
0
0
10 Mar 2025
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Bardia Safaei
Faizan Siddiqui
Jiacong Xu
Vishal M. Patel
Shao-Yuan Lo
VLM
478
1
0
10 Mar 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
74
0
0
10 Mar 2025
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
Xing Xie
Jiawei Liu
Ziyue Lin
Huijie Fan
Zhi Han
Yandong Tang
Liangqiong Qu
113
0
0
10 Mar 2025
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo
Yingying Zhang
Xiaoyu Yang
Kang Wu
Qi Zhu
Lei Liang
Jingdong Chen
Yansheng Li
153
2
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRLLRM
150
3
0
10 Mar 2025
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
Wenzhuo Xu
Zhipeng Wei
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Xinming Zhang
AAML
90
0
0
10 Mar 2025
LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
Bangyan Li
Wenxuan Huang
Yunhang Shen
Yansen Wang
Shaohui Lin
...
Ling You
Yinqi Zhang
Ke Li
Xing Sun
Yan Sun
93
2
0
10 Mar 2025
RoboDesign1M: A Large-scale Dataset for Robot Design Understanding
T. H. Le
T. H. Nguyen
Quang-Dieu Tran
Quang Minh Nguyen
Baoru Huang
Hoan Nguyen
M. Vu
Tung D. Ta
A. Nguyen
3DV
121
0
0
09 Mar 2025
SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding
Juhyeon Park
P. Y. Kim
Jiook Cha
Shinjae Yoo
Taesup Moon
95
0
0
09 Mar 2025
AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
Wenxin Ma
Xu Zhang
Qingsong Yao
Fenghe Tang
Chenxu Wu
Yongbin Li
Rui Yan
Zihang Jiang
S. Kevin Zhou
VLM
111
3
0
09 Mar 2025
Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point
Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point
Travis Barrett
Amit Kumar Mishra
80
1
0
09 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Zhenpeng Chen
Chunwei Wang
Xiuwei Chen
Hongbin Xu
Jiawei Han
Xiandan Liang
J. N. Han
Hang Xu
Xiaodan Liang
VLM
183
2
0
09 Mar 2025
M3^33amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification
Mingxiang Cao
Weiying Xie
Xin Zhang
Jiaqing Zhang
Kai Jiang
Jie Lei
Yunsong Li
Mamba
150
0
0
09 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
148
2
0
08 Mar 2025
Text-to-3D Generation using Jensen-Shannon Score Distillation
Text-to-3D Generation using Jensen-Shannon Score Distillation
Khoi Do
Binh-Son Hua
DiffM
90
0
0
08 Mar 2025
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
Xiangyu Yin
Jiaxu Liu
Zhen Chen
Jinwei Hu
Yi Dong
Xiaowei Huang
Wenjie Ruan
AAML
93
0
0
08 Mar 2025
SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai
Songyou Peng
Kyle Genova
Leonidas Guibas
Thomas Funkhouser
3DGS
147
1
0
08 Mar 2025
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Yinuo Liu
Zenghui Yuan
Guiyao Tie
Jiawen Shi
Lichao Sun
Lichao Sun
Neil Zhenqiang Gong
112
2
0
08 Mar 2025
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin
Haoran Chen
Yue Fan
Yingqi Fan
Xin Jin
Hui Su
Jinlan Fu
Xiaoyu Shen
101
0
0
08 Mar 2025
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
Seil Kang
Jinyeong Kim
Junhyeok Kim
Seong Jae Hwang
VLM
127
5
0
08 Mar 2025
Previous
123...121314...454647
Next