ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,345 papers shown
Title
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
De-An Huang
Subhashree Radhakrishnan
Zhiding Yu
Jan Kautz
VGenVLM
212
0
0
24 Apr 2025
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Phillip Y. Lee
Jihyeon Je
Chanho Park
Mikaela Angelina Uy
Leonidas Guibas
Minhyuk Sung
LRM
113
3
0
24 Apr 2025
Token Sequence Compression for Efficient Multimodal Computing
Token Sequence Compression for Efficient Multimodal Computing
Yasmine Omri
Parth Shroff
Thierry Tambe
100
1
0
24 Apr 2025
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
Chengkai Huang
Hongtao Huang
Tong Yu
Kaige Xie
Junda Wu
Shuai Zhang
Julian McAuley
Dietmar Jannach
Lina Yao
LRMAI4CE
86
1
0
23 Apr 2025
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes
Joan Perez
Giovanni Fusco
57
1
0
23 Apr 2025
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
Hariseetharam Gunduboina
Muhammad Haris Khan
Biplab Banerjee
VLM
94
0
0
23 Apr 2025
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Chris
Yichen Wei
Yi Peng
Xiang Wang
Weijie Qiu
...
Jianhao Zhang
Y. Hao
Xuchen Song
Yang Liu
Yahui Zhou
OffRLAI4TSSyDaLRMVLM
154
9
0
23 Apr 2025
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Hanlei Zhang
Zhuohang Li
Yeshuang Zhu
Hua Xu
Peiwu Wang
Haige Zhu
Jie Zhou
Jinchao Zhang
132
0
0
23 Apr 2025
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents
Siyu Zhou
Tianyi Zhou
Yijun Yang
Guodong Long
Deheng Ye
Jing Jiang
Chengqi Zhang
LM&Ro
80
1
0
22 Apr 2025
FaceInsight: A Multimodal Large Language Model for Face Perception
FaceInsight: A Multimodal Large Language Model for Face Perception
Jingzhi Li
Changjiang Luo
Ruoyu Chen
Hua Zhang
Wenqi Ren
Jianhou Gan
Xiaochun Cao
CVBMLRM
138
0
0
22 Apr 2025
Multimodal Perception for Goal-oriented Navigation: A Survey
Multimodal Perception for Goal-oriented Navigation: A Survey
I-Tak Ieong
Hao Tang
LM&RoLRM
102
0
0
22 Apr 2025
ForesightNav: Learning Scene Imagination for Efficient Exploration
ForesightNav: Learning Scene Imagination for Efficient Exploration
Hardik Shah
Jiaxu Xing
Nico Messikommer
Boyang Sun
Marc Pollefeys
Davide Scaramuzza
224
1
0
22 Apr 2025
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
TrustGeoGen: Scalable and Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Daocheng Fu
Zijun Chen
Renqiu Xia
Qi Liu
Yuan Feng
...
Peng Gao
Junchi Yan
Botian Shi
Bo Zhang
Yu Qiao
96
3
0
22 Apr 2025
Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
Chang Zong
Bin Li
Shoujun Zhou
Jian Wan
Lei Zhang
470
0
0
22 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-Xiong Wang
VLM
110
1
0
22 Apr 2025
AffordanceSAM: Segment Anything Once More in Affordance Grounding
AffordanceSAM: Segment Anything Once More in Affordance Grounding
Dengyang Jiang
Mengmeng Wang
Teli Ma
Haoyang Li
Yang Liu
Guang Dai
Lefei Zhang
91
0
0
22 Apr 2025
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi: Large Multimodal Models for Video Understanding and Editing
Vidi Team
Celong Liu
Chia-Wen Kuo
Dawei Du
Fan Chen
...
Wen Zhong
Xiaohui Shen
Xin Gu
Xing Mei
Xueqiong Qu
106
0
0
22 Apr 2025
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models
Siyuan Liang
Jiayang Liu
Jiecheng Zhai
Tianmeng Fang
Rongcheng Tu
A. Liu
Xiaochun Cao
Dacheng Tao
VGen
101
2
0
22 Apr 2025
Towards Understanding Camera Motions in Any Video
Towards Understanding Camera Motions in Any Video
Zhiqiu Lin
Siyuan Cen
Daniel Jiang
Jay Karhade
Hewei Wang
...
Rushikesh Zawar
Xue Bai
Yilun Du
Chuang Gan
Deva Ramanan
VGen
101
3
0
21 Apr 2025
GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection
GenCLIP: Generalizing CLIP Prompts for Zero-shot Anomaly Detection
Donghyeong Kim
Chaewon Park
Suhwan Cho
Hyeonjeong Lim
Minseok Kang
Jungho Lee
Sangyoun Lee
VLM
133
0
0
21 Apr 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
David Ma
Yanzhe Zhang
J. Ren
Jarvis Guo
Yifan Yao
...
Shiwen Ni
Jing Liu
Wenhao Huang
Ge Zhang
Xiaojie Jin
VLM
141
1
0
21 Apr 2025
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Yassir Benhammou
Alessandro Tiberio
Gabriel Trautmann
Suman Kalyan
MLLMVLM
71
0
0
21 Apr 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
Weiye Xu
Jun Wang
Weiyun Wang
Zhe Chen
Wengang Zhou
...
Xiaohua Wang
Xizhou Zhu
Wenhai Wang
Jifeng Dai
Jinguo Zhu
VLMLRM
186
7
0
21 Apr 2025
AGI-Driven Generative Semantic Communications: Principles and Practices
AGI-Driven Generative Semantic Communications: Principles and Practices
Xiaojun Yuan
Haoming Ma
Yinuo Huang
Zhoufan Hua
Yong Zuo
Z. Ding
AI4CE
82
0
0
21 Apr 2025
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding
Geng Li
Jinglin Xu
Yunzhen Zhao
Yuxin Peng
ObjD
94
3
0
21 Apr 2025
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
Jiachen Li
Qing Xie
Xiaohan Yu
Hongyun Wang
Jinyu Xu
Yongjian Liu
ObjD
150
0
0
20 Apr 2025
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion
Mingjie Zhang
Yuheng Du
Chengkai Wu
Jinni Zhou
Zhenchao Qi
Jun Ma
Boyu Zhou
218
0
0
20 Apr 2025
Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction
Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction
Li Yu
Xuanzhe Sun
Wei Zhou
Moncef Gabbouj
DiffM
71
0
0
19 Apr 2025
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?
Rahul Thapa
Andrew Li
Qingyang Wu
Bryan He
Yuki Sahashi
...
Angela Zhang
Ben Athiwaratkun
Shuaiwen Leon Song
David Ouyang
James Zou
LM&MA
174
0
0
19 Apr 2025
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training
Andrea Amaduzzi
Pierluigi Zama Ramirez
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
105
1
0
18 Apr 2025
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Analysing the Robustness of Vision-Language-Models to Common Corruptions
Muhammad Usama
Syeda Aishah Asim
Syed Bilal Ali
Syed Talal Wasim
Umair Bin Mansoor
VLM
93
0
0
18 Apr 2025
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang
MLLMVLM
269
2
0
17 Apr 2025
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval
WonJun Moon
Cheol-Ho Cho
Woojin Jun
Minho Shim
Taeoh Kim
Inwoong Lee
Dongyoon Wee
Jae-Pil Heo
99
0
0
17 Apr 2025
Science-T2I: Addressing Scientific Illusions in Image Synthesis
Science-T2I: Addressing Scientific Illusions in Image Synthesis
Jialuo Li
Wenhao Chai
Xingyu Fu
Haiyang Xu
Saining Xie
MedIm
80
1
0
17 Apr 2025
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tianze Zhang
Yin Zhuang
He Chen
Jun Li
Xuerui Mao
101
0
0
17 Apr 2025
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Xiangyan Liu
Jinjie Ni
Zijian Wu
Chao Du
Longxu Dou
Haoran Wang
Tianyu Pang
Michael Shieh
OffRLLRM
487
16
0
17 Apr 2025
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models
Zhanglin Wu
Tengfei Song
Ning Xie
Mengli Zhu
Weidong Zhang
...
Pengfei Li
Chong Li
Junhao Zhu
Hao Yang
Shiliang Sun
119
2
0
16 Apr 2025
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Yifei Dong
Fengyi Wu
Sanjian Zhang
Guangyu Chen
Yuzhi Hu
...
Jingdong Sun
Siyu Huang
Feng Liu
Qi Dai
Zhi-Qi Cheng
123
0
0
16 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
96
0
0
16 Apr 2025
FLIP Reasoning Challenge
FLIP Reasoning Challenge
Andreas Plesner
Turlan Kuzhagaliyev
Roger Wattenhofer
AAMLVLMLRM
187
0
0
16 Apr 2025
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
Xinli Yue
Jianhui Sun
Junda Lu
Liangchao Yao
Fan Xia
Tianyi Wang
Fengyun Rao
Jing Lyu
Yuetang Deng
85
2
0
16 Apr 2025
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
58
0
0
15 Apr 2025
Video Summarization with Large Language Models
Video Summarization with Large Language Models
Min Jung Lee
Dayoung Gong
Minsu Cho
82
0
0
15 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
98
1
0
15 Apr 2025
Benchmarking Vision Language Models on German Factual Data
Benchmarking Vision Language Models on German Factual Data
René Peinl
Vincent Tischler
CoGe
172
1
0
15 Apr 2025
InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation
InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation
Yukang Lin
Y. Hong
Zunnan Xu
Xiaochen Li
Chao Xu
...
Jun Lan
Huijia Zhu
Weiqiang Wang
Jianfu Zhang
Xiu Li
VGen
99
0
0
15 Apr 2025
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Kaiwen Zheng
Xuri Ge
Junchen Fu
Jun Peng
J. Jose
CVBM
68
0
0
14 Apr 2025
MIEB: Massive Image Embedding Benchmark
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao
Isaac Chung
Imene Kerboua
Jamie Stirling
Xin Zhang
Márton Kardos
Roman Solomatin
Noura Al Moubayed
Kenneth Enevoldsen
Niklas Muennighoff
VLM
148
2
0
14 Apr 2025
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
Tan-Hanh Pham
Chris Ngo
Trong-Duong Bui
Minh Luu Quang
Tan-Huong Pham
Truong-Son Hy
122
2
0
14 Apr 2025
ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models
ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models
Amirhosein Chahe
Lifeng Zhou
LRM
93
0
0
14 Apr 2025
Previous
123...678...454647
Next