ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
VideoLLM-online: Online Video Large Language Model for Streaming Video
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen
Zhaoyang Lv
Shiwei Wu
Kevin Qinghong Lin
Chenan Song
Difei Gao
Jia-Wei Liu
Ziteng Gao
Dongxing Mao
Mike Zheng Shou
MLLMMoMe
137
59
0
17 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&RoVLM
110
27
0
17 Jun 2024
AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic
  Manipulation
AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
Chuyan Xiong
Chengyu Shen
Xiaoqi Li
Kaichen Zhou
Jiaming Liu
Ruiping Wang
Hao Dong
LRM
110
13
0
17 Jun 2024
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li
Xinyu Chen
Baotian Hu
Longyue Wang
Haoyuan Shi
Min Zhang
MLLMLRM
167
38
0
17 Jun 2024
Generative Visual Instruction Tuning
Generative Visual Instruction Tuning
Jefferson Hernandez
Ruben Villegas
Vicente Ordonez
VLM
73
4
0
17 Jun 2024
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and
  Lexical Alterations
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Sri Harsha Dumpala
Aman Jaiswal
Chandramouli Shama Sastry
E. Milios
Sageev Oore
Hassan Sajjad
CoGe
94
12
0
17 Jun 2024
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with
  Instruction Tuning
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Zebang Cheng
Zhi-Qi Cheng
Jun-Yan He
Jingdong Sun
Kai Wang
Yuxiang Lin
Zheng Lian
Xiaojiang Peng
Alexander G. Hauptmann
MLLM
124
40
0
17 Jun 2024
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Han-Hung Lee
Yiming Zhang
Angel X. Chang
3DPC
164
4
0
17 Jun 2024
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Robert Honig
Javier Rando
Nicholas Carlini
Florian Tramèr
WIGMAAML
137
21
0
17 Jun 2024
GUICourse: From General Vision Language Models to Versatile GUI Agents
GUICourse: From General Vision Language Models to Versatile GUI Agents
Wentong Chen
Junbo Cui
Jinyi Hu
Yujia Qin
Junjie Fang
...
Yupeng Huo
Yuan Yao
Yankai Lin
Zhiyuan Liu
Maosong Sun
LLMAG
162
41
0
17 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human
  Preferences
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
112
33
0
16 Jun 2024
Leveraging Foundation Models for Multi-modal Federated Learning with
  Incomplete Modality
Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality
Liwei Che
Jiaqi Wang
Xinyue Liu
Fenglong Ma
67
4
0
16 Jun 2024
Investigating Video Reasoning Capability of Large Language Models with
  Tropes in Movies
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies
Hung-Ting Su
Chun-Tong Chao
Ya-Ching Hsu
Xudong Lin
Yulei Niu
Hung-Yi Lee
Winston H. Hsu
LRM
76
1
0
16 Jun 2024
Improving Large Models with Small models: Lower Costs and Better
  Performance
Improving Large Models with Small models: Lower Costs and Better Performance
Dong Chen
Shuo Zhang
Yueting Zhuang
Siliang Tang
Qidong Liu
Hua Wang
Mingliang Xu
96
6
0
15 Jun 2024
BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
BEACON: Benchmark for Comprehensive RNA Tasks and Language Models
Yuchen Ren
Zhiyuan Chen
Lifeng Qiao
Hongtai Jing
Yuchen Cai
...
Siqi Sun
Hongliang Yan
Dong Yuan
Wanli Ouyang
Xihui Liu
83
11
0
14 Jun 2024
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language
  Large Models
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Chenyu Zhou
Mengdan Zhang
Peixian Chen
Chaoyou Fu
Yunhang Shen
Xiawu Zheng
Xing Sun
Rongrong Ji
VLM
79
4
0
14 Jun 2024
Detecting and Evaluating Medical Hallucinations in Large Vision Language
  Models
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Jiawei Chen
Dingkang Yang
Tong Wu
Yue Jiang
Xiaolu Hou
Mingcheng Li
Shunli Wang
Dongling Xiao
Ke Li
Lihua Zhang
LM&MAVLM
85
22
0
14 Jun 2024
CarLLaVA: Vision language models for camera-only closed-loop driving
CarLLaVA: Vision language models for camera-only closed-loop driving
Katrin Renz
Long Chen
Ana-Maria Marcu
Jan Hünermann
Benoît Hanotte
Alice Karnsund
Jamie Shotton
Elahe Arani
Oleg Sinavski
VLM
138
26
0
14 Jun 2024
Localizing Events in Videos with Multimodal Queries
Localizing Events in Videos with Multimodal Queries
Gengyuan Zhang
Mang Ling Ada Fok
Yan Xia
Yansong Tang
Zorah Lähner
Philip Torr
Volker Tresp
Jindong Gu
98
2
0
14 Jun 2024
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot
  Audio Task Learner
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner
Dongchao Yang
Haohan Guo
Yuanyuan Wang
Rongjie Huang
Xiang Li
Xu Tan
Xixin Wu
Helen Meng
AuLLM
93
17
0
14 Jun 2024
What Does Softmax Probability Tell Us about Classifiers Ranking Across
  Diverse Test Conditions?
What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?
Weijie Tu
Weijian Deng
Liang Zheng
Tom Gedeon
96
1
0
14 Jun 2024
GPT-4o: Visual perception performance of multimodal large language
  models in piglet activity understanding
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
Yiqi Wu
Xiaodan Hu
Ziming Fu
Siling Zhou
Jiangong Li
MLLM
73
12
0
14 Jun 2024
Contrastive Imitation Learning for Language-guided Multi-Task Robotic
  Manipulation
Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation
Teli Ma
Jiaming Zhou
Zifan Wang
Ronghe Qiu
Junwei Liang
90
11
0
14 Jun 2024
Composing Parts for Expressive Object Generation
Composing Parts for Expressive Object Generation
Harsh Rangwani
Aishwarya Agarwal
Kuldeep Kulkarni
R. Venkatesh Babu
Srikrishna Karanam
DiffM
110
2
0
14 Jun 2024
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
ImageNet3D: Towards General-Purpose Object-Level 3D Understanding
Wufei Ma
Guanning Zeng
Guofeng Zhang
Qihao Liu
Letian Zhang
Adam Kortylewski
Yaoyao Liu
Alan Yuille
VLM3DV
91
10
0
13 Jun 2024
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video
  Understanding
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad A Khan
VLMMLLM
97
61
0
13 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
87
1
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks
  and Algorithms
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng Zhang
Qi Dai
Chong Luo
Xin Geng
Baining Guo
VLM
89
1
0
13 Jun 2024
SimGen: Simulator-conditioned Driving Scene Generation
SimGen: Simulator-conditioned Driving Scene Generation
Yunsong Zhou
Michael Simon
Zhenghao Peng
Sicheng Mo
Hongzi Zhu
Minyi Guo
Bolei Zhou
VGen
93
17
0
13 Jun 2024
Towards Vision-Language Geo-Foundation Model: A Survey
Towards Vision-Language Geo-Foundation Model: A Survey
Yue Zhou
Xue Jiang
Yiping Ke
Yiping Ke
Junchi Yan
Xue Yang
Wayne Zhang
89
19
0
13 Jun 2024
CLIPAway: Harmonizing Focused Embeddings for Removing Objects via
  Diffusion Models
CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models
Yigit Ekin
Ahmet Burak Yildirim
Erdem Çağlar
Aykut Erdem
Erkut Erdem
Aysegül Dündar
DiffM
90
10
0
13 Jun 2024
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Samar Fares
Klea Ziu
Toluwani Aremu
Nikita Durasov
Martin Takáč
Pascal Fua
Karthik Nandakumar
Ivan Laptev
VLMAAML
101
5
0
13 Jun 2024
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim
Karl Pertsch
Siddharth Karamcheti
Ted Xiao
Ashwin Balakrishna
...
Russ Tedrake
Dorsa Sadigh
Sergey Levine
Percy Liang
Chelsea Finn
LM&RoVLM
296
535
0
13 Jun 2024
Comparison Visual Instruction Tuning
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
91
4
0
13 Jun 2024
ReMI: A Dataset for Reasoning with Multiple Images
ReMI: A Dataset for Reasoning with Multiple Images
Mehran Kazemi
Nishanth Dikkala
Ankit Anand
Petar Dević
Ishita Dasgupta
...
Bahare Fatemi
Pranjal Awasthi
Dee Guo
Sreenivas Gollapudi
Ahmed Qureshi
LRMVLM
110
17
0
13 Jun 2024
Generative AI-based Prompt Evolution Engineering Design Optimization
  With Vision-Language Model
Generative AI-based Prompt Evolution Engineering Design Optimization With Vision-Language Model
Melvin Wong
Thiago Rios
Stefan Menzel
Yew-Soon Ong
78
3
0
13 Jun 2024
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance
  in Insurance
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin
Hanjia Lyu
Xian Xu
Jiebo Luo
72
2
0
13 Jun 2024
Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality
  Generation
Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation
Lincan Cai
Shuang Li
Wenxuan Ma
Jingxuan Kang
Binhui Xie
Zixun Sun
Chengwei Zhu
MoEMoMe
91
1
0
13 Jun 2024
Conceptual Learning via Embedding Approximations for Reinforcing
  Interpretability and Transparency
Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency
Maor Dikter
Tsachi Blau
Chaim Baskin
130
0
0
13 Jun 2024
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park
Kanchana Ranasinghe
Kumara Kahatapitiya
Wonjeong Ryoo
Donghyun Kim
Michael S. Ryoo
162
25
0
13 Jun 2024
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
Xuannan Liu
Zekun Li
Peipei Li
Shuhan Xia
Xing Cui
Linzhi Huang
Huaibo Huang
Weihong Deng
Zhaofeng He
143
23
0
13 Jun 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
171
3
0
13 Jun 2024
Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Bing Li
Cheng Zheng
Wenxuan Zhu
Jinjie Mai
Biao Zhang
Peter Wonka
Bernard Ghanem
110
17
0
12 Jun 2024
FakeInversion: Learning to Detect Images from Unseen Text-to-Image
  Models by Inverting Stable Diffusion
FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
George Cazenavette
Avneesh Sud
Thomas Leung
Ben Usman
DiffM
80
22
0
12 Jun 2024
Pandora: Towards General World Model with Natural Language Actions and
  Video States
Pandora: Towards General World Model with Natural Language Actions and Video States
Jiannan Xiang
Guangyi Liu
Yi Gu
Qiyue Gao
Yuting Ning
...
Shibo Hao
Yemin Shi
Zhengzhong Liu
Eric P. Xing
Zhiting Hu
VGen
151
43
0
12 Jun 2024
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Yi-Fan Zhang
Qingsong Wen
Chaoyou Fu
Xue Wang
Zhang Zhang
Liwen Wang
Rong Jin
135
46
0
12 Jun 2024
What If We Recaption Billions of Web Images with LLaMA-3?
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li
Haoqin Tu
Mude Hui
Zeyu Wang
Bingchen Zhao
...
Jieru Mei
Qing Liu
Huangjie Zheng
Yuyin Zhou
Cihang Xie
VLMMLLM
97
42
0
12 Jun 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLMOffRL
119
27
0
12 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGenLRM
137
22
0
12 Jun 2024
DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor
Juncheng Wu
Zhangkai Ni
Hanli Wang
Wenhan Yang
Yuyin Zhou
Shiqi Wang
78
1
0
12 Jun 2024
Previous
123...323334...464748
Next