ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
128
43
0
27 Jun 2024
DIM: Dynamic Integration of Multimodal Entity Linking with Large
  Language Model
DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model
Shangwen Wang
Huijun Liu
Jie Yu
Shan Zhao
Xiaopeng Li
Jun Ma
Xiaodong Liu
Zhuo Li
Xiaoguang Mao
60
1
0
27 Jun 2024
DocKylin: A Large Multimodal Model for Visual Document Understanding
  with Efficient Visual Slimming
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Jiaxin Zhang
Wentao Yang
Songxuan Lai
Zecheng Xie
Lianwen Jin
100
21
0
27 Jun 2024
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image
  Generation
AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
Yanan Sun
Yanchen Liu
Yinhao Tang
Wenjie Pei
Kai Chen
DiffM
114
11
0
27 Jun 2024
Factor-Conditioned Speaking-Style Captioning
Factor-Conditioned Speaking-Style Captioning
Atsushi Ando
Takafumi Moriya
Shota Horiguchi
Ryo Masumura
73
0
0
27 Jun 2024
Curriculum Learning with Quality-Driven Data Selection
Curriculum Learning with Quality-Driven Data Selection
Biao Wu
Fang Meng
117
2
0
27 Jun 2024
Foundational Models for Pathology and Endoscopy Images: Application for
  Gastric Inflammation
Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
H. Kerdegari
Kyle Higgins
Dennis Veselkov
I. Laponogov
I. Poļaka
...
Junior Andrea Pescino
M. Leja
M. Dinis-Ribeiro
T. F. Kanonnikoff
Kirill Veselkov
112
5
0
26 Jun 2024
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
Jiafeng Liang
Shixin Jiang
Zekun Wang
Haojie Pan
Zerui Chen
Zheng Chu
Ming Liu
Ruiji Fu
Zhongyuan Wang
Bing Qin
69
3
0
26 Jun 2024
MammothModa: Multi-Modal Large Language Model
MammothModa: Multi-Modal Large Language Model
Qi She
Junwen Pan
Xin Wan
Rui Zhang
Dawei Lu
Kai Huang
MLLMVLM
58
1
0
26 Jun 2024
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment
  Retrieval
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval
Weitong Cai
Jiabo Huang
Shaogang Gong
Hailin Jin
Yang Liu
83
0
0
25 Jun 2024
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Xiangyu Zhao
Xiangtai Li
Haodong Duan
Haian Huang
Yining Li
Kai Chen
Hua Yang
VLMMLLM
120
12
0
25 Jun 2024
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for
  Efficient Scanned Document Annotation
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation
Ahmad Mohammadshirazi
Ali Nosrati Firoozsalari
Mengxi Zhou
Dheeraj Kulshrestha
R. Ramnath
93
0
0
25 Jun 2024
A Comprehensive Solution to Connect Speech Encoder and Large Language
  Model for ASR
A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR
Van Tung Pham
Yist Y. Lin
Tao Han
Wei Li
Jun Zhang
Lu Lu
Yuxuan Wang
AuLLM
70
1
0
25 Jun 2024
Evaluating the Quality of Hallucination Benchmarks for Large
  Vision-Language Models
Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
Bei Yan
Jie Zhang
Zheng Yuan
Shiguang Shan
Xilin Chen
VLM
68
5
0
24 Jun 2024
Long Context Transfer from Language to Vision
Long Context Transfer from Language to Vision
Peiyuan Zhang
Kaichen Zhang
Bo Li
Guangtao Zeng
Jingkang Yang
Yuanhan Zhang
Ziyue Wang
Haoran Tan
Chunyuan Li
Ziwei Liu
VLM
145
189
0
24 Jun 2024
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Ye Wang
Yuting Mei
Sipeng Zheng
Qin Jin
LRM
133
4
0
24 Jun 2024
DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World
  Image Super-Resolution
DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-Resolution
Aiwen Jiang
Zhi Wei
Long Peng
Feiqiang Liu
Wenbo Li
Mingwen Wang
DiffM
82
2
0
24 Jun 2024
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
  Large Video-Language Models
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang
Yueqian Wang
Dongyan Zhao
Cihang Xie
Zilong Zheng
MLLMVLM
100
31
0
24 Jun 2024
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Yuang Peng
Yuxin Cui
Haomiao Tang
Zekun Qi
Runpei Dong
Jing Bai
Chunrui Han
Zheng Ge
Xiangyu Zhang
Shu-Tao Xia
EGVM
197
39
0
24 Jun 2024
Reading Is Believing: Revisiting Language Bottleneck Models for Image
  Classification
Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification
Honori Udo
Takafumi Koshinaka
VLM
71
0
0
22 Jun 2024
MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision
  Perception
MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception
Guanqun Wang
Xinyu Wei
Jiaming Liu
Ray Zhang
Yichi Zhang
Kevin Zhang
Maurice Chong
Shanghang Zhang
VLMLRM
62
0
0
22 Jun 2024
Evaluating Large Vision-and-Language Models on Children's Mathematical
  Olympiads
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads
A. Cherian
Kuan-Chuan Peng
Suhas Lohit
Joanna Matthiesen
Kevin A. Smith
J. Tenenbaum
ELMLRM
77
8
0
22 Jun 2024
MetaGreen: Meta-Learning Inspired Transformer Selection for Green
  Semantic Communication
MetaGreen: Meta-Learning Inspired Transformer Selection for Green Semantic Communication
Shubhabrata Mukherjee
Cory Beard
Sejun Song
70
0
0
22 Jun 2024
Image Conductor: Precision Control for Interactive Video Synthesis
Image Conductor: Precision Control for Interactive Video Synthesis
Yaowei Li
Xintao Wang
Zhaoyang Zhang
Zhouxia Wang
Ziyang Yuan
Liangbin Xie
Yuexian Zou
Ying Shan
VGen
120
27
0
21 Jun 2024
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Brandon Huang
Chancharik Mitra
Assaf Arbelle
Leonid Karlinsky
Trevor Darrell
Roei Herzig
101
21
0
21 Jun 2024
Towards Retrieval Augmented Generation over Large Video Libraries
Towards Retrieval Augmented Generation over Large Video Libraries
Yannis Tevissen
Khalil Guetari
Frédéric Petitpont
RALM
77
2
0
21 Jun 2024
LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking
LLM2TEA: Agentic AI Designer Finds Innovative Objects with Generative Evolutionary Multitasking
Melvin Wong
Jiao Liu
Thiago Rios
Stefan Menzel
Yew-Soon Ong
117
2
0
21 Jun 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Yuxuan Qiao
Haodong Duan
Xinyu Fang
Junming Yang
Lin Chen
Songyang Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LRM
107
23
0
20 Jun 2024
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
  Understanding
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
Xinyu Fang
Kangrui Mao
Haodong Duan
Xiangyu Zhao
Yining Li
Dahua Lin
Kai Chen
VLM
112
83
0
20 Jun 2024
IWISDM: Assessing instruction following in multimodal models at scale
IWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei
Lucas Gomez
Hao Yuan Bai
P. Bashivan
VLM
122
2
0
20 Jun 2024
REVEAL-IT: REinforcement learning with Visibility of Evolving Agent
  poLicy for InTerpretability
REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Shuang Ao
Simon Khan
Haris Aziz
Flora D. Salim
138
0
0
20 Jun 2024
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large
  Vision-Language Model
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
Jie Zhang
Sibo Wang
Xiangkui Cao
Zheng Yuan
Shiguang Shan
Xilin Chen
Wen Gao
VLM
90
10
0
20 Jun 2024
Multi-modal Transfer Learning between Biological Foundation Models
Multi-modal Transfer Learning between Biological Foundation Models
Juan Jose Garau-Luis
Patrick Bordes
Liam Gonzalez
Masa Roller
Bernardo P. de Almeida
...
Stefan Laurent
Jan Grzegorzewski
Maren Lang
Thomas Pierrot
Guillaume Richard
AI4CE
99
6
0
20 Jun 2024
HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment
HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment
Yongqiang Chen
Quanming Yao
Juzheng Zhang
James Cheng
Yatao Bian
131
3
0
20 Jun 2024
Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation
Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation
Eyal Michaeli
Ohad Fried
119
1
0
20 Jun 2024
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual
  Generation
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
Baiqi Li
Zhiqiu Lin
Deepak Pathak
Jiayao Li
Yixin Fei
...
Tiffany Ling
Xide Xia
Pengchuan Zhang
Graham Neubig
Deva Ramanan
EGVM
141
39
0
19 Jun 2024
StableSemantics: A Synthetic Language-Vision Dataset of Semantic
  Representations in Naturalistic Images
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Rushikesh Zawar
Shaurya Dewan
Andrew F. Luo
Margaret M. Henderson
Michael J. Tarr
Leila Wehbe
VGenCoGe
78
1
0
19 Jun 2024
GUI Action Narrator: Where and When Did That Action Take Place?
GUI Action Narrator: Where and When Did That Action Take Place?
Qinchen Wu
Difei Gao
Kevin Qinghong Lin
Zhuoyu Wu
Xiangwu Guo
Peiran Li
Weichen Zhang
Hengxu Wang
Mike Zheng Shou
105
3
0
19 Jun 2024
SpatialBot: Precise Spatial Understanding with Vision Language Models
SpatialBot: Precise Spatial Understanding with Vision Language Models
Wenxiao Cai
Yaroslav Ponomarenko
Jianhao Yuan
Xiaoqi Li
Wankou Yang
Hao Dong
Bo Zhao
VLM
126
46
0
19 Jun 2024
Improving Visual Commonsense in Language Models via Multiple Image
  Generation
Improving Visual Commonsense in Language Models via Multiple Image Generation
Guy Yariv
Idan Schwartz
Yossi Adi
Sagie Benaim
VLMLRM
48
0
0
19 Jun 2024
Is AI fun? HumorDB: a curated dataset and benchmark to investigate
  graphical humor
Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humor
Veedant Jain
Felipe dos Santos Alves Feitosa
Gabriel Kreiman
VLM
99
2
0
19 Jun 2024
Transferable speech-to-text large language model alignment module
Transferable speech-to-text large language model alignment module
Boyong Wu
Chao Yan
Haoran Pu
45
0
0
19 Jun 2024
Reinforcing Pre-trained Models Using Counterfactual Images
Reinforcing Pre-trained Models Using Counterfactual Images
Xiang Li
Ren Togo
Keisuke Maeda
Takahiro Ogawa
Miki Haseyama
75
1
0
19 Jun 2024
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via
  Multimodal LLMs
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs
Navid Rajabi
Jana Kosecka
71
14
0
19 Jun 2024
RITA: A Real-time Interactive Talking Avatars Framework
RITA: A Real-time Interactive Talking Avatars Framework
Wuxinlin Cheng
Cheng Wan
Yupeng Cao
Sihan Chen
82
0
0
18 Jun 2024
SeTAR: Out-of-Distribution Detection with Selective Low-Rank
  Approximation
SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation
Yixia Li
Boya Xiong
Guanhua Chen
Yun Chen
OODD
99
4
0
18 Jun 2024
Automatic benchmarking of large multimodal models via iterative
  experiment programming
Automatic benchmarking of large multimodal models via iterative experiment programming
Alessandro Conti
Enrico Fini
Paolo Rota
Yiming Wang
Massimiliano Mancini
Elisa Ricci
112
1
0
18 Jun 2024
VoCo-LLaMA: Towards Vision Compression with Large Language Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye
Yukang Gan
Xiaoke Huang
Yixiao Ge
Yansong Tang
MLLMVLM
133
28
0
18 Jun 2024
LLaNA: Large Language and NeRF Assistant
LLaNA: Large Language and NeRF Assistant
Andrea Amaduzzi
Pierluigi Zama Ramirez
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
106
4
0
17 Jun 2024
Unveiling Encoder-Free Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
Haiwen Diao
Yufeng Cui
Xiaotong Li
Yueze Wang
Huchuan Lu
Xinlong Wang
VLM
122
36
0
17 Jun 2024
Previous
123...313233...464748
Next