ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,347 papers shown
Title
Teaching VLMs to Localize Specific Objects from In-context Examples
Teaching VLMs to Localize Specific Objects from In-context Examples
Sivan Doveh
Nimrod Shabtay
Wei Lin
Eli Schwartz
Hilde Kuehne
...
Leonid Karlinsky
James Glass
Assaf Arbelle
S. Ullman
Muhammad Jehanzeb Mirza
VLM
195
1
0
20 Nov 2024
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large
  Language Models
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
Zhen Zeng
Leijiang Gu
Xun Yang
Zhangling Duan
Zenglin Shi
Meng Wang
KELM
126
2
0
19 Nov 2024
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
Med-2E3: A 2D-Enhanced 3D Medical Multimodal Large Language Model
Yiming Shi
Xun Zhu
Ying Hu
Chenyi Guo
Miao Li
Ji Wu
136
2
0
19 Nov 2024
Generative Timelines for Instructed Visual Assembly
Generative Timelines for Instructed Visual Assembly
Alejandro Pardo
Jui-hsien Wang
Guohao Li
Josef Sivic
Bryan C. Russell
Fabian Caba Heilbron
VGen
106
0
0
19 Nov 2024
VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation
Bangguo Yu
Yuzhen Liu
Lei Han
Hamidreza Kasaei
Tingguang Li
M. Cao
LM&Ro
184
3
0
18 Nov 2024
SignEye: Traffic Sign Interpretation from Vehicle First-Person View
Chuang Yang
Xu Han
T. Han
Yuejiao Su
Junyu Gao
Hongyuan Zhang
Yi Wang
Lap-Pui Chau
123
2
0
18 Nov 2024
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts
Junwen He
Yifan Wang
Lijun Wang
Huchuan Lu
Jun-Yan He
Chong Li
Hanyuan Chen
Jin-Peng Lan
Bin Luo
Yifeng Geng
119
1
0
18 Nov 2024
Video-to-Task Learning via Motion-Guided Attention for Few-Shot Action Recognition
Hanyu Guo
Wanchuan Yu
Suzhou Que
Kaiwen Du
Yan Yan
Hanzi Wang
189
1
0
18 Nov 2024
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Ruichuan An
Sihan Yang
Ming Lu
Kai Zeng
Yulin Luo
...
Hao Liang
Qi She
Shanghang Zhang
Wentao Zhang
Wentao Zhang
200
11
0
18 Nov 2024
Efficient Transfer Learning for Video-language Foundation Models
Haoxing Chen
Zizheng Huang
Y. Hong
Yanshuo Wang
Zhongcai Lyu
Zhuoer Xu
Jun Lan
Zhangxuan Gu
VLM
105
0
0
18 Nov 2024
On-Board Vision-Language Models for Personalized Autonomous Vehicle
  Motion Control: System Design and Real-World Validation
On-Board Vision-Language Models for Personalized Autonomous Vehicle Motion Control: System Design and Real-World Validation
Can Cui
Zichong Yang
Yupeng Zhou
Juntong Peng
Sung-Yeon Park
...
Yiheng Feng
Jitesh Panchal
Lingxi Li
Yaobin Chen
Ziran Wang
124
9
0
17 Nov 2024
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Tingyu Qu
Mingxiao Li
Tinne Tuytelaars
Marie-Francine Moens
VLM
113
2
0
17 Nov 2024
Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry
Wenjun Hou
Yi Cheng
Kaishuai Xu
Yan Hu
Wenjie Li
Jiang-Dong Liu
68
1
0
17 Nov 2024
MpoxVLM: A Vision-Language Model for Diagnosing Skin Lesions from Mpox Virus Infection
Xu Cao
Wenqian Ye
K. Moise
Megan Coffee
90
2
0
16 Nov 2024
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
  Experts
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
Jinqiang Long
Yanqi Dai
Guoxing Yang
Hongpeng Lin
Nanyi Fei
Yizhao Gao
Zhiwu Lu
MoEVLM
90
1
0
16 Nov 2024
Visual-Linguistic Agent: Towards Collaborative Contextual Object
  Reasoning
Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning
Jingru Yang
Huan Yu
Yang Jingxin
C. Xu
Yin Biao
Yu Sun
Shengfeng He
58
1
0
15 Nov 2024
Explanation for Trajectory Planning using Multi-modal Large Language
  Model for Autonomous Driving
Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving
Shota Yamazaki
Chenyu Zhang
Takuya Nanri
Akio Shigekane
Siyuan Wang
Jo Nishiyama
Tao Chu
Kohei Yokosawa
LRM
104
1
0
15 Nov 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang
Zhe Chen
Wenhai Wang
Yue Cao
Yangzhou Liu
...
Jinguo Zhu
X. Zhu
Lewei Lu
Yu Qiao
Jifeng Dai
LRM
143
93
1
15 Nov 2024
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical
  2D Inpainting
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting
Yian Wang
Xiaowen Qiu
Jiageng Liu
Zhehuan Chen
Jiting Cai
Yufei Wang
Tsun-Hsuan Wang
Zhou Xian
Chuang Gan
VGenAI4CE
108
7
0
14 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Zechao Li
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
74
4
0
14 Nov 2024
LLV-FSR: Exploiting Large Language-Vision Prior for Face
  Super-resolution
LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution
Chenyang Wang
Wenjie An
Kui Jiang
Xianming Liu
Junjun Jiang
CVBM
53
2
0
14 Nov 2024
Harnessing Vision Foundation Models for High-Performance, Training-Free
  Open Vocabulary Segmentation
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi
Minjing Dong
Chang Xu
VLM
118
3
0
14 Nov 2024
Spider: Any-to-Many Multimodal LLM
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
189
2
0
14 Nov 2024
Multimodal Instruction Tuning with Hybrid State Space Models
Multimodal Instruction Tuning with Hybrid State Space Models
Jianing Zhou
Han Li
Shuai Zhang
Ning Xie
Ruijie Wang
Xiaohan Nie
Sheng Liu
Lingyun Wang
79
0
0
13 Nov 2024
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied
  Vision-and-Language Navigation
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation
Youzhi Liu
Fanglong Yao
Yuanchang Yue
Guangluan Xu
Xian Sun
Kun Fu
LM&Ro
102
3
0
13 Nov 2024
Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from
  Social Media
Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media
N. V. R. Chappa
Charlotte McCormick
Susana Rodriguez Gongora
P. Dobbs
Khoa Luu
137
2
0
12 Nov 2024
SparrowVQE: Visual Question Explanation for Course Content Understanding
SparrowVQE: Visual Question Explanation for Course Content Understanding
Jialu Li
Manish Kumar Thota
Ruslan Gokhman
Radek Holik
Youshan Zhang
105
1
0
12 Nov 2024
Prompt-enhanced Network for Hateful Meme Classification
Prompt-enhanced Network for Hateful Meme Classification
Junxi Liu
Yanyan Feng
Jiehai Chen
Yun Xue
Fenghuan Li
VLM
111
0
0
12 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGenVLM
321
2
0
11 Nov 2024
ViTOC: Vision Transformer and Object-aware Captioner
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
102
0
0
09 Nov 2024
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language
  Models
Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
Xiao Liu
Lijun Zhang
Deepak Ganesan
Hui Guan
VLM
102
0
0
08 Nov 2024
Improving image synthesis with diffusion-negative sampling
Improving image synthesis with diffusion-negative sampling
Alakh Desai
Nuno Vasconcelos
DiffM
42
2
0
08 Nov 2024
Autoregressive Models in Vision: A Survey
Autoregressive Models in Vision: A Survey
Jing Xiong
Gongye Liu
Lun Huang
Chengyue Wu
Taiqiang Wu
...
Hao Fei
Guillermo Sapiro
Jiebo Luo
Ping Luo
Ngai Wong
VGen
191
14
0
08 Nov 2024
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion
  Models
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Shuhong Zheng
Zhipeng Bao
Ruoyu Zhao
Martial Hebert
Yu-Xiong Wang
DiffM
162
0
0
07 Nov 2024
HourVideo: 1-Hour Video-Language Understanding
HourVideo: 1-Hour Video-Language Understanding
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Manling Li
Jiajun Wu
L. Fei-Fei
VLM
108
49
0
07 Nov 2024
AsCAN: Asymmetric Convolution-Attention Networks for Efficient
  Recognition and Generation
AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation
Anil Kag
Huseyin Coskun
Jierun Chen
Junli Cao
Willi Menapace
Aliaksandr Siarohin
Sergey Tulyakov
Jian Ren
93
3
0
07 Nov 2024
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language
  Models
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
Jonathan Fhima
Elad Ben Avraham
Oren Nuriel
Yair Kittenplon
Roy Ganz
Aviad Aberdam
Ron Litman
VLM
67
1
0
07 Nov 2024
Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs
Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs
Chengxin Hu
Hao Li
Yihe Yuan
Jing Li
Ivor Tsang
127
1
0
07 Nov 2024
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
Jingwei Xu
Chenyu Wang
Zibo Zhao
Wen Liu
Yi-An Ma
Shenghua Gao
141
18
0
07 Nov 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLMVGenVLM
127
9
0
07 Nov 2024
DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language
  Model and Context-Injected Large Language Model
DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model
Tianhao He
Andrija Stankovic
E. Niforatos
Gerd Kortuem
MLLMVGenVLM
78
0
0
06 Nov 2024
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
D. Song
Sicheng Lai
Shunian Chen
Lichao Sun
Benyou Wang
462
1
0
06 Nov 2024
MME-Finance: A Multimodal Finance Benchmark for Expert-level
  Understanding and Reasoning
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Ziliang Gan
Yu Lu
D. Zhang
Haohan Li
Che Liu
...
Haipang Wu
Chaoyou Fu
Z. Xu
Rongjunchen Zhang
Yong Dai
106
13
0
05 Nov 2024
Exploring the Interplay Between Video Generation and World Models in
  Autonomous Driving: A Survey
Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey
Ao Fu
Yi Zhou
Tao Zhou
Yue Yang
Bojun Gao
Qun Li
Guobin Wu
Ling Shao
VGen
100
3
0
05 Nov 2024
Membership Inference Attacks against Large Vision-Language Models
Membership Inference Attacks against Large Vision-Language Models
Zhan Li
Yongtao Wu
Yihang Chen
F. Tonin
Elias Abad Rocamora
Volkan Cevher
80
9
0
05 Nov 2024
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
Mengzhao Chen
Chaowei Xiao
Chaowei Xiao
MU
131
10
0
05 Nov 2024
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
  Vision-Language Negatives
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGeCLIP
143
14
0
04 Nov 2024
INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
Edward Vendrow
Omiros Pantazis
Alexander Shepard
Gabriel J. Brostow
Kate E. Jones
Oisin Mac Aodha
Sara Beery
Grant Van Horn
VLM
111
7
0
04 Nov 2024
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Ruyang Liu
Haoran Tang
Haibo Liu
Yixiao Ge
Ying Shan
Chen Li
Jiankun Yang
VLM
72
7
0
04 Nov 2024
TableGPT2: A Large Multimodal Model with Tabular Data Integration
TableGPT2: A Large Multimodal Model with Tabular Data Integration
Aofeng Su
Aowen Wang
Chao Ye
Chen Zhou
G. Zhang
...
Xijun Gu
Xingwu Sun
Xianrui Li
Yue Yang
Zhiqing Xiao
PINNVLMLMTD
138
23
0
04 Nov 2024
Previous
123...192021...454647
Next