Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
132
28
0
04 Apr 2024
Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra
Darioush Kevian
U. Syed
Xing-ming Guo
Aaron J. Havens
Geir Dullerud
Peter M. Seiler
Lianhui Qin
Bin Hu
ELM
106
33
0
04 Apr 2024
WorDepth: Variational Language Prior for Monocular Depth Estimation
Ziyao Zeng
Daniel Wang
Fengyu Yang
Hyoungseob Park
Yangchao Wu
Stefano Soatto
Byung-Woo Hong
Dong Lao
Alex Wong
MDE
136
28
0
04 Apr 2024
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Deyao Zhu
Jian Ding
Mohamed Elhoseiny
VLM
99
79
0
04 Apr 2024
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
VLM
127
65
0
04 Apr 2024
Diverse and Tailored Image Generation for Zero-shot Multi-label Classification
Kai Zhang
Zhixiang Yuan
Tao Huang
VLM
79
4
0
04 Apr 2024
On the Scalability of Diffusion-based Text-to-Image Generation
Hao Li
Yang Zou
Ying Wang
Orchid Majumder
Yusheng Xie
R. Manmatha
Ashwin Swaminathan
Zhuowen Tu
Stefano Ermon
Stefano Soatto
96
23
0
03 Apr 2024
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
Petru-Daniel Tudosiu
Yongxin Yang
Shifeng Zhang
Fei Chen
Jingyu Sun
Gerasimos Lampouras
Ignacio Iacobacci
Sarah Parisot
95
12
0
03 Apr 2024
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Haofan Wang
Matteo Spinelli
Qixun Wang
Xu Bai
Zekui Qin
Anthony Chen
DiffM
126
97
0
03 Apr 2024
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection
Mamadou Keita
W. Hamidouche
Hassen Bougueffa
Abdenour Hadid
Abdelmalik Taleb-Ahmed
51
3
0
03 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLM
MLLM
73
3
0
02 Apr 2024
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models
Kyuyoung Kim
Jongheon Jeong
Minyong An
Mohammad Ghavamzadeh
Krishnamurthy Dvijotham
Jinwoo Shin
Kimin Lee
EGVM
81
6
0
02 Apr 2024
MotionChain: Conversational Motion Controllers via Multimodal Prompts
Biao Jiang
Xin Chen
C. Zhang
Fukun Yin
Zhuoyuan Li
Gang Yu
Jiayuan Fan
VGen
LRM
96
11
0
02 Apr 2024
Predicting the Performance of Foundation Models via Agreement-on-the-Line
Aman Mehra
Rahul Saxena
Taeyoun Kim
Christina Baek
Zico Kolter
Aditi Raghunathan
UQCV
86
2
0
02 Apr 2024
OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation
Xiongwei Wu
Sicheng Yu
Ee-Peng Lim
Chong-Wah Ngo
VLM
73
2
0
01 Apr 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
93
7
0
01 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
107
42
0
01 Apr 2024
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Zhiqiu Lin
Deepak Pathak
Baiqi Li
Jiayao Li
Xide Xia
Graham Neubig
Pengchuan Zhang
Deva Ramanan
EGVM
150
171
0
01 Apr 2024
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Ruohong Zhang
Liangke Gui
Zhiqing Sun
Yihao Feng
Keyang Xu
...
Di Fu
Chunyuan Li
Alexander G. Hauptmann
Yonatan Bisk
Yiming Yang
MLLM
131
78
0
01 Apr 2024
Survey of Bias In Text-to-Image Generation: Definition, Evaluation, and Mitigation
Yixin Wan
Arjun Subramonian
Anaelia Ovalle
Zongyu Lin
Ashima Suvarna
Christina Chance
Hritik Bansal
Rebecca Pattichis
Kai-Wei Chang
EGVM
168
36
0
01 Apr 2024
AIGCOIQA2024: Perceptual Quality Assessment of AI Generated Omnidirectional Images
Liu Yang
Huiyu Duan
Long Teng
Yucheng Zhu
Xiaohong Liu
Menghan Hu
Xiongkuo Min
Guangtao Zhai
P. Callet
EGVM
75
15
0
01 Apr 2024
Harnessing Large Language Models for Training-free Video Anomaly Detection
Luca Zanella
Willi Menapace
Massimiliano Mancini
Yiming Wang
Elisa Ricci
VLM
111
30
0
01 Apr 2024
Continual Learning for Smart City: A Survey
Li Yang
Zhipeng Luo
Shi-sheng Zhang
Fei Teng
Tian-Jie Li
HAI
98
9
0
01 Apr 2024
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
105
6
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
40
2
0
01 Apr 2024
Prompt Learning via Meta-Regularization
Jinyoung Park
Juyeon Ko
Hyunwoo J. Kim
VLM
VPVLM
98
19
0
01 Apr 2024
M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
Fan Bai
Yuxin Du
Tiejun Huang
Max Q.-H. Meng
Bo Zhao
77
43
0
31 Mar 2024
Bayesian Exploration of Pre-trained Models for Low-shot Image Classification
Yibo Miao
Yu Lei
Feng Zhou
Zhijie Deng
VLM
UQCV
BDL
104
3
0
30 Mar 2024
ST-LLM: Large Language Models Are Effective Temporal Learners
Ruyang Liu
Chen Li
Haoran Tang
Yixiao Ge
Ying Shan
Ge Li
104
82
0
30 Mar 2024
Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training
Tongkun Su
Jun Li
Xi Zhang
Haibo Jin
Hao Chen
Qiong Wang
Faqin Lv
Baoliang Zhao
Yin Hu
71
0
0
30 Mar 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Yuhang Zang
...
Haodong Duan
Jiaqi Wang
Yu Qiao
Dahua Lin
Feng Zhao
VLM
137
303
0
29 Mar 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLM
CoGe
86
20
0
29 Mar 2024
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
Barbara Toniella Corradini
Mustafa Shukor
Paul Couairon
Guillaume Couairon
Franco Scarselli
Matthieu Cord
DiffM
VLM
112
6
0
29 Mar 2024
FairCLIP: Harnessing Fairness in Vision-Language Learning
Yan Luo
Minfei Shi
Muhammad Osama Khan
Muhammad Muneeb Afzal
Hao Huang
...
Luo Song
Ava Kouhana
T. Elze
Yi Fang
Mengyu Wang
VLM
93
38
0
29 Mar 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
184
47
0
29 Mar 2024
Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving
Akshay Gopalkrishnan
Ross Greer
Mohan M. Trivedi
VLM
96
25
0
28 Mar 2024
A Review of Multi-Modal Large Language and Vision Models
Kilian Carolan
Laura Fennelly
Alan F. Smeaton
VLM
186
28
0
28 Mar 2024
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
Jiaxing Chen
Yuxuan Liu
Dehu Li
Xiang An
Weimo Deng
Ziyong Feng
Yongle Zhao
Yin Xie
LRM
96
15
0
28 Mar 2024
Text Data-Centric Image Captioning with Interactive Prompts
Yiyu Wang
Hao Luo
Jungang Xu
Yingfei Sun
Fan Wang
VLM
76
0
0
28 Mar 2024
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition
Jianqiang Wan
Sibo Song
Wenwen Yu
Yuliang Liu
Wenqing Cheng
Fei Huang
Xiang Bai
Cong Yao
Zhibo Yang
99
37
0
28 Mar 2024
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents
Zeren Chen
Zhelun Shi
Xiaoya Lu
Lehan He
Sucheng Qian
...
Zhen-fei Yin
Jing Shao
Jing Shao
Cewu Lu
Cewu Lu
75
6
0
28 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
151
5
0
28 Mar 2024
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li
Yuechen Zhang
Chengyao Wang
Zhisheng Zhong
Yixin Chen
Ruihang Chu
Shaoteng Liu
Jiaya Jia
VLM
MLLM
MoE
131
238
0
27 Mar 2024
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object
Chenshuang Zhang
Fei Pan
Junmo Kim
In So Kweon
Chengzhi Mao
85
11
1
27 Mar 2024
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Inhwan Bae
Junoh Lee
Hae-Gon Jeon
105
22
0
27 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
110
56
0
27 Mar 2024
Generative Multi-modal Models are Good Class-Incremental Learners
Xusheng Cao
Haori Lu
Linlan Huang
Xialei Liu
Ming-Ming Cheng
CLL
95
15
0
27 Mar 2024
Multi-Modal Contrastive Learning for Online Clinical Time-Series Applications
Fabian Baldenweg
Manuel Burger
Gunnar Rätsch
Rita Kuznetsova
AI4TS
115
0
0
27 Mar 2024
Online Embedding Multi-Scale CLIP Features into 3D Maps
Shun Taguchi
Hideki Deguchi
50
0
0
27 Mar 2024
Garment3DGen: 3D Garment Stylization and Texture Generation
N. Sarafianos
Tuur Stuyck
Xiaoyu Xiang
Yilei Li
Jovan Popovic
Rakesh Ranjan
3DH
185
20
0
27 Mar 2024
Previous
1
2
3
...
40
41
42
...
45
46
47
Next