Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,352 papers shown
Title
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment
Wenliang Zhong
Wenyi Wu
Qi Li
Rob Barton
Boxin Du
Shioulin Sam
Karim Bouyarmane
Ismail B. Tutar
Junzhou Huang
92
3
0
05 Jun 2024
GraphAlign: Pretraining One Graph Neural Network on Multiple Graphs via Feature Alignment
Zhenyu Hou
Haozhan Li
Yukuo Cen
Jie Tang
Yuxiao Dong
95
8
0
05 Jun 2024
Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter
Peng-Fei Xing
Ning Wang
Jianbo Ouyang
Zechao Li
DiffM
72
1
0
05 Jun 2024
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Zicheng Zhang
H. Wu
Chunyi Li
Yingjie Zhou
Wei Sun
Xiongkuo Min
Zijian Chen
Xiaohong Liu
Weisi Lin
Guangtao Zhai
EGVM
148
18
0
05 Jun 2024
Item-Language Model for Conversational Recommendation
Li Yang
Anushya Subbiah
Hardik Patel
Judith Yue Li
Yanwei Song
Reza Mirghaderi
Vikram Aggarwal
Qifan Wang
KELM
94
5
0
05 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
101
5
0
04 Jun 2024
V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation
Cong Wang
Kuan Tian
Jun Zhang
Yonghang Guan
Feng Luo
Fei Shen
Zhiwei Jiang
Qing Gu
Xiao Han
Wei Yang
129
45
0
04 Jun 2024
Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
Haodong Hong
Sen Wang
Zi Huang
Qi Wu
Jiajun Liu
109
4
0
04 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
102
7
0
04 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
117
16
0
04 Jun 2024
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
Inkyu Shin
Qihang Yu
Xiaohui Shen
In So Kweon
KuK-Jin Yoon
Liang-Chieh Chen
VGen
DiffM
127
1
0
04 Jun 2024
Parrot: Multilingual Visual Instruction Tuning
Hai-Long Sun
Da-Wei Zhou
Yangfu Li
Shiyin Lu
Chao Yi
...
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
MLLM
163
12
0
04 Jun 2024
L-MAGIC: Language Model Assisted Generation of Images with Coherence
Zhipeng Cai
Matthias Mueller
R. Birkl
Diana Wofk
Shaoyen Tseng
JunDa Cheng
Gabriela Ben-Melech Stan
Vasudev Lal
Michael Paulitsch
DiffM
MLLM
85
6
0
03 Jun 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
An-Chieh Cheng
Hongxu Yin
Yang Fu
Qiushan Guo
Ruihan Yang
Jan Kautz
Xiaolong Wang
Sifei Liu
LRM
120
75
0
03 Jun 2024
ELSA: Evaluating Localization of Social Activities in Urban Streets
Maryam Hosseini
Marco Cipriano
Sedigheh Eslami
Daniel Hodczak
Liu Liu
Andres Sevtsuk
Gerard de Melo
67
0
0
03 Jun 2024
Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation
Enhui Ma
Lijun Zhou
Tao Tang
Zhan Zhang
Dong Han
...
Peng Jia
Xianpeng Lang
Haiyang Sun
Di Lin
Kaicheng Yu
VGen
114
28
0
03 Jun 2024
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
Weichao Zhao
Hao Feng
Qi Liu
Jingqun Tang
Shubo Wei
...
Lei Liao
Yongjie Ye
Hao Liu
Houqiang Li
Can Huang
LMTD
100
24
0
03 Jun 2024
Towards Practical Single-shot Motion Synthesis
Konstantinos Roditakis
Spyridon Thermos
N. Zioulis
VGen
121
0
0
03 Jun 2024
MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
Vahid Azizi
Fatemeh Koochaki
VLM
112
0
0
03 Jun 2024
Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications
David Restrepo
Chenwei Wu
Sebastián Andrés Cajas
Luis Filipe Nakayama
Leo Anthony Celi
Diego M. Lopez
66
3
0
02 Jun 2024
Image Captioning via Dynamic Path Customization
Yiwei Ma
Jiayi Ji
Xiaoshuai Sun
Yiyi Zhou
Xiaopeng Hong
Yongjian Wu
Rongrong Ji
81
1
0
01 Jun 2024
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLM
VGen
90
10
0
01 Jun 2024
Query2CAD: Generating CAD models using natural language queries
Akshay Badagabettu
Sai Sravan Yarlagadda
A. Farimani
81
15
0
31 May 2024
Empowering Visual Creativity: A Vision-Language Assistant to Image Editing Recommendations
Tiancheng Shen
Jun Hao Liew
Long Mai
Lu Qi
Jiashi Feng
Jiaya Jia
DiffM
60
2
0
31 May 2024
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond
Pengyuan Lyu
Yulin Li
Hao Zhou
Weihong Ma
Xingyu Wan
...
Liang Wu
Chengquan Zhang
Kun Yao
Errui Ding
Jingdong Wang
76
7
0
31 May 2024
Hard Cases Detection in Motion Prediction by Vision-Language Foundation Models
Yi Yang
Qingwen Zhang
Kei Ikemura
Nazre Batool
John Folkesson
VLM
77
2
0
31 May 2024
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models
Linli Yao
Lei Li
Shuhuai Ren
Lean Wang
Yuanxin Liu
Xu Sun
Lu Hou
76
34
0
31 May 2024
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models
Sijin Chen
Xin Chen
Anqi Pang
Xianfang Zeng
Wei Cheng
...
C. Zhang
Jingyi Yu
Gang Yu
Bin-Bin Fu
Tao Chen
AI4CE
135
43
0
31 May 2024
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning
Cheng Tan
Jingxuan Wei
Linzhuang Sun
Zhangyang Gao
Siyuan Li
Bihui Yu
Ruifeng Guo
Stan Z. Li
ReLM
LRM
3DV
115
7
0
31 May 2024
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Shiyin Lu
Yang Li
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Han-Jia Ye
VLM
MLLM
144
55
0
31 May 2024
InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding
Huaxiang Zhang
Yaojia Mu
Guo-Niu Zhu
Zhongxue Gan
83
2
0
31 May 2024
Joint Embeddings for Graph Instruction Tuning
Vlad Argatu
Aaron Haag
Oliver Lohse
93
0
0
31 May 2024
Information Theoretic Text-to-Image Alignment
Chao Wang
Giulio Franzese
A. Finamore
Massimo Gallo
Pietro Michiardi
176
0
0
31 May 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Xiawu Zheng
Enhong Chen
Caifeng Shan
Xing Sun
Xing Sun
VLM
MLLM
185
421
0
31 May 2024
Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images
Krishnakant Singh
Thanush Navaratnam
Jannik Holmer
Simone Schaub-Meyer
Stefan Roth
DiffM
99
21
0
30 May 2024
Visual Perception by Large Language Model's Weights
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
69
8
0
30 May 2024
VividDream: Generating 3D Scene with Ambient Dynamics
Yao-Chih Lee
Yi-Ting Chen
Andrew Wang
Ting-Hsuan Liao
Brandon Y. Feng
Jia-Bin Huang
VGen
82
12
0
30 May 2024
LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild
Zhiqiang Wang
Dejia Xu
Rana Muhammad Shahroz Khan
Yanbin Lin
Zhiwen Fan
Xingquan Zhu
77
4
0
30 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
121
18
0
30 May 2024
NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models
Kai Wu
Boyuan Jiang
Zhengkai Jiang
Qingdong He
Donghao Luo
Shengzhi Wang
Qingwen Liu
Chengjie Wang
VLM
MLLM
115
4
0
30 May 2024
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen
Han Zhang
Zhantao Yang
Hao Chen
Kai Hu
Marios Savvides
ObjD
VLM
89
5
0
30 May 2024
Instruction-Guided Visual Masking
Jinliang Zheng
Jianxiong Li
Si Cheng
Yinan Zheng
Jiaming Li
Jihao Liu
Yu Liu
Jingjing Liu
Xianyuan Zhan
138
7
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
100
46
0
30 May 2024
Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases
Zian Su
Xiangzhe Xu
Ziyang Huang
Kaiyuan Zhang
Xiangyu Zhang
86
8
0
30 May 2024
Don't drop your samples! Coherence-aware training benefits Conditional diffusion
Nicolas Dufour
Victor Besnier
Vicky Kalogeiton
David Picard
DiffM
135
2
0
30 May 2024
Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiayan Yang
Jinhao Duan
Yichi Wang
...
Qiang Zhang
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
142
10
0
30 May 2024
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Qianqi Yan
Xuehai He
Xiang Yue
Xin Eric Wang
LM&MA
139
12
0
30 May 2024
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Alex Fang
Wenjing Zhou
Kevin Jamieson
S. Du
104
9
0
29 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
86
35
0
29 May 2024
Video Anomaly Detection in 10 Years: A Survey and Outlook
Moshira Abdalla
Sajid Javed
Muaz Al Radi
Anwaar Ulhaq
Naoufel Werghi
93
5
0
29 May 2024
Previous
1
2
3
...
34
35
36
...
46
47
48
Next