ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12597
  4. Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
v1v2v3 (latest)

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
    VLMMLLM
ArXiv (abs)PDFHTML

Papers citing "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"

50 / 2,352 papers shown
Title
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
226
72
0
19 Sep 2024
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
Kalakonda Sai Shashank
Shubh Maheshwari
Ravi Kiran Sarvadevabhatla
VGenDiffM
107
3
0
18 Sep 2024
The Impact of Element Ordering on LM Agent Performance
The Impact of Element Ordering on LM Agent Performance
Wayne Chi
Ameet Talwalkar
Chris Donahue
66
2
0
18 Sep 2024
ABHINAW: A method for Automatic Evaluation of Typography within
  AI-Generated Images
ABHINAW: A method for Automatic Evaluation of Typography within AI-Generated Images
Abhinaw Jagtap
Nachiket Tapas
R. G. Brajesh
EGVM
74
0
0
18 Sep 2024
Navigation with VLM framework: Go to Any Language
Navigation with VLM framework: Go to Any Language
Zecheng Yin
Chonghao Cheng
Lizhen
LM&Ro
49
0
0
18 Sep 2024
One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation
One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation
F. L. Busch
Timon Homberger
Jesús Ortega-Peimbert
Quantao Yang
Olov Andersson
89
1
0
18 Sep 2024
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Large Language Models are Strong Audio-Visual Speech Recognition Learners
Umberto Cappellazzo
Minsu Kim
Honglie Chen
Pingchuan Ma
Stavros Petridis
Daniele Falavigna
Alessio Brutti
Maja Pantic
114
12
0
18 Sep 2024
NVLM: Open Frontier-Class Multimodal LLMs
NVLM: Open Frontier-Class Multimodal LLMs
Wenliang Dai
Nayeon Lee
Wei Ping
Zhuoling Yang
Zihan Liu
Jon Barker
Tuomas Rintamaki
Mohammad Shoeybi
Bryan Catanzaro
Ming-Yu Liu
MLLMVLMLRM
123
73
0
17 Sep 2024
CoCA: Regaining Safety-awareness of Multimodal Large Language Models
  with Constitutional Calibration
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration
Jiahui Gao
Renjie Pi
Tianyang Han
Han Wu
Lanqing Hong
Lingpeng Kong
Xin Jiang
Zhenguo Li
125
8
0
17 Sep 2024
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Surveying the MLLM Landscape: A Meta-Review of Current Surveys
Ming Li
Keyu Chen
Ziqian Bi
Ming Liu
Benji Peng
...
Jinlang Wang
Sen Zhang
X. Pan
Jiawei Xu
Pohsun Feng
OffRL
118
2
0
17 Sep 2024
Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for
  Multilingual Speech-to-Text
Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text
Hongfei Xue
Wei Ren
Xuelong Geng
Kun Wei
Longhao Li
Qijie Shao
Linju Yang
Kai Diao
Lei Xie
AuLLM
80
4
0
17 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Bilal Faye
Hanane Azzag
M. Lebbah
ObjD
105
0
0
17 Sep 2024
AMEGO: Active Memory from long EGOcentric videos
AMEGO: Active Memory from long EGOcentric videos
Gabriele Goletto
Tushar Nagarajan
Giuseppe Averta
Dima Damen
EgoV
89
7
0
17 Sep 2024
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models
Potsawee Manakul
Guangzhi Sun
Warit Sirichotedumrong
Kasima Tharnpipitchai
Kunat Pipatanakul
AuLLM
124
7
0
17 Sep 2024
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
Sina Malakouti
Aysan Aghazadeh
Ashmit Khandelwal
Adriana Kovashka
VLM
108
2
0
16 Sep 2024
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large
  Language Models
Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models
Bingchen Liu
Ehsan Akhgari
Alexander Visheratin
Aleks Kamko
Linmiao Xu
Shivam Shrirao
Joao Souza
Suhail Doshi
Daiqing Li
Daiqing Li
DiffMMLLM
109
60
0
16 Sep 2024
Do Pre-trained Vision-Language Models Encode Object States?
Do Pre-trained Vision-Language Models Encode Object States?
Kaleb Newman
Shijie Wang
Yuan Zang
David Heffren
Chen Sun
CoGe
71
1
0
16 Sep 2024
SoccerNet 2024 Challenges Results
SoccerNet 2024 Challenges Results
A. Cioppa
Silvio Giancola
Vladimir Somers
Victor Joos
Floriane Magera
...
Yuan Li
Yuting Yang
Yuxuan Xiao
Zehua Cheng
Zhihao Li
99
2
0
16 Sep 2024
Latent Diffusion Models for Controllable RNA Sequence Generation
Latent Diffusion Models for Controllable RNA Sequence Generation
Kaixuan Huang
Yukang Yang
Kaidi Fu
Yanyi Chu
Le Cong
Mengdi Wang
89
2
0
15 Sep 2024
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao
Zhuoyue Wang
Hang Zhang
Lun Wang
VLM
103
17
0
15 Sep 2024
Generative Semantic Communication via Textual Prompts: Latency Performance Tradeoffs
Generative Semantic Communication via Textual Prompts: Latency Performance Tradeoffs
Mengmeng Ren
Li Qiao
Long Yang
Zhen Gao
Jian Chen
Mahdi Boloursaz Mashhadi
Pei Xiao
Rahim Tafazolli
Mehdi Bennis
VLM
149
5
0
15 Sep 2024
One missing piece in Vision and Language: A Survey on Comics Understanding
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli
Andrey Barsky
Mohamed Ali Souibgui
Artemis LLabres
Marco Bertini
Dimosthenis Karatzas
126
5
0
14 Sep 2024
Generating Event-oriented Attribution for Movies via Two-Stage
  Prefix-Enhanced Multimodal LLM
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM
Yuanjie Lyu
Tong Xu
Zihan Niu
Bo Peng
Jing Ke
Enhong Chen
66
0
0
14 Sep 2024
Guiding Vision-Language Model Selection for Visual Question-Answering
  Across Tasks, Domains, and Knowledge Types
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha
Vinija Jain
Aman Chadha
74
3
0
14 Sep 2024
Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models
Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models
Dewen Zhang
Wangpeng An
Hayaru Shouno
3DH
38
0
0
14 Sep 2024
PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion
  Preimage
PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage
Denis Zavadski
Damjan Kalšan
Carsten Rother
DiffMMDE
73
7
0
13 Sep 2024
Towards Unified Facial Action Unit Recognition Framework by Large
  Language Models
Towards Unified Facial Action Unit Recognition Framework by Large Language Models
Guohong Hu
Xing Lan
Hanyu Jiang
Jiayi Lyu
Jian Xue
CVBM
76
1
0
13 Sep 2024
DeCLIP: Decoding CLIP representations for deepfake localization
DeCLIP: Decoding CLIP representations for deepfake localization
Stefan Smeu
Elisabeta Oneata
Dan Oneaţă
101
4
0
12 Sep 2024
SimMAT: Exploring Transferability from Vision Foundation Models to Any
  Image Modality
SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality
Chenyang Lei
Liyi Chen
Jun Cen
Xiao Chen
Zhen Lei
Felix Heide
Ziwei Liu
Qifeng Chen
Zhaoxiang Zhang
97
0
0
12 Sep 2024
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang
Yanjiang Guo
Xiaoyu Chen
Yen-Jen Wang
Yucheng Hu
Chengming Shi
Jianyu Chen
94
13
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
105
6
0
11 Sep 2024
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Yang Liu
Pengxiang Ding
Siteng Huang
Min Zhang
Han Zhao
Donglin Wang
84
7
0
11 Sep 2024
Pushing the Limits of Vision-Language Models in Remote Sensing without
  Human Annotations
Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations
Keumgang Cha
Donggeun Yu
Junghoon Seo
VLM
78
1
0
11 Sep 2024
What to align in multimodal contrastive learning?
What to align in multimodal contrastive learning?
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
158
4
0
11 Sep 2024
INTRA: Interaction Relationship-aware Weakly Supervised Affordance
  Grounding
INTRA: Interaction Relationship-aware Weakly Supervised Affordance Grounding
Ji Ha Jang
H. Seo
Se Young Chun
93
3
0
10 Sep 2024
EDADepth: Enhanced Data Augmentation for Monocular Depth Estimation
EDADepth: Enhanced Data Augmentation for Monocular Depth Estimation
Nischal Khanal
Shivanand Venkanna Sheshappanavar
MDE
100
0
0
10 Sep 2024
Revisiting Prompt Pretraining of Vision-Language Models
Revisiting Prompt Pretraining of Vision-Language Models
Zhenyuan Chen
Lingfeng Yang
Shuo Chen
Zhaowei Chen
Jiajun Liang
Xiang Li
MLLMVPVLMVLM
121
2
0
10 Sep 2024
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large
  Language Model
MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model
Zhen Yang
Jinhao Chen
Zhengxiao Du
Wenmeng Yu
Weihan Wang
Wenyi Hong
Zhihuan Jiang
Bin Xu
Yuxiao Dong
Jie Tang
VLMLRM
84
11
0
10 Sep 2024
MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images
  for 3D Design Feedback
MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback
Chen Chen
Cuong Nguyen
Thibault Groueix
Vladimir G. Kim
Nadir Weibel
DiffM
67
4
0
09 Sep 2024
Referring Expression Generation in Visually Grounded Dialogue with
  Discourse-aware Comprehension Guiding
Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
Bram Willemsen
Gabriel Skantze
127
0
0
09 Sep 2024
Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance
Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance
Quang-Huy Che
Duc-Tri Le
Vinh-Tiep Nguyen
D. Lam
Vinh-Tiep Nguyen
DiffM
255
1
0
09 Sep 2024
VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage
  $\underline{P}$re-training Framework for $\underline{Ro}$botic and
  Laparoscopic Surgery
VidLPRO: A Vid‾\underline{Vid}Vid​eo-L‾\underline{L}L​anguage P‾\underline{P}P​re-training Framework for Ro‾\underline{Ro}Ro​botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
151
2
0
07 Sep 2024
UNIT: Unifying Image and Text Recognition in One Vision Encoder
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Yi Zhu
Yanpeng Zhou
Chunwei Wang
Yang Cao
Jianhua Han
Lu Hou
Hang Xu
ViTVLM
114
4
0
06 Sep 2024
Large Language Models in Drug Discovery and Development: From Disease
  Mechanisms to Clinical Trials
Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials
Yizhen Zheng
Huan Yee Koh
M. Yang
Li Li
Lauren T. May
Geoffrey I. Webb
Shirui Pan
George Church
LM&MA
98
13
0
06 Sep 2024
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
  Document Understanding
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Anwen Hu
Haiyang Xu
Liang Zhang
Jiabo Ye
Ming Yan
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
104
37
0
05 Sep 2024
Improving agent performance in fluid environments by perceptual
  pretraining
Improving agent performance in fluid environments by perceptual pretraining
Jin Zhang
Jianyang Xue
Bochao Cao
AI4CE
66
0
0
05 Sep 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-Xiong Wang
151
23
0
05 Sep 2024
LinFusion: 1 GPU, 1 Minute, 16K Image
LinFusion: 1 GPU, 1 Minute, 16K Image
Songhua Liu
Weihao Yu
Zhenxiong Tan
Xinchao Wang
123
16
0
03 Sep 2024
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Haoran Wei
Chenglong Liu
Jinyue Chen
Jia Wang
Lingyu Kong
...
Liang Zhao
Jianjian Sun
Yuang Peng
Chunrui Han
Xiangyu Zhang
VLM
102
55
0
03 Sep 2024
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for
  Robotic Manipulation
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang
Chen Wang
Yongqian Li
Ruohan Zhang
Li Fei-Fei
134
115
0
03 Sep 2024
Previous
123...252627...464748
Next