Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,347 papers shown
Title
VistaDream: Sampling multiview consistent images for single-view scene reconstruction
Haiping Wang
Yuan Liu
Ziwei Liu
Wenping Wang
Z. Dong
Bisheng Yang
110
13
0
22 Oct 2024
PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding
Vinh Nguyen
3DV
31
0
0
22 Oct 2024
Foundation Models for Rapid Autonomy Validation
Alec Farid
Peter Schleede
Aaron Huang
Christoffer Heckman
101
0
0
22 Oct 2024
Progressive Compositionality in Text-to-Image Generative Models
Xu Han
Linghao Jin
Xiaofeng Liu
Paul Pu Liang
CoGe
149
4
0
22 Oct 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing
Qidong Huang
Xiaoyi Dong
Jiajie Lu
Pan Zhang
...
Yuhang Cao
Zeang Sheng
Jiaqi Wang
Feng Wu
Dahua Lin
VLM
133
46
0
22 Oct 2024
Frontiers in Intelligent Colonoscopy
Ge-Peng Ji
Jingyi Liu
Peng Xu
Nick Barnes
Fahad Shahbaz Khan
Salman Khan
Deng-Ping Fan
127
5
0
22 Oct 2024
Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions
Malte Prinzler
Egor Zakharov
V. Sklyarova
Berna Kabadayi
Justus Thies
DiffM
74
5
0
21 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
169
40
0
21 Oct 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Han Huang
Yuqi Huo
Zijia Zhao
Haoyu Lu
Shu Wu
Bin Wang
Qiang Liu
Weipeng Chen
Liang Wang
VLM
67
1
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
90
1
0
21 Oct 2024
Mitigating Object Hallucination via Concentric Causal Attention
Yun Xing
Yiheng Li
Ivan Laptev
Shijian Lu
108
23
0
21 Oct 2024
How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?
Zuojin Tang
Bin-Bin Hu
Chenyang Zhao
De Ma
Gang Pan
Bin Liu
119
1
0
21 Oct 2024
Reducing Hallucinations in Vision-Language Models via Latent Space Steering
Sheng Liu
Haotian Ye
Lei Xing
James Zou
VLM
LLMSV
167
9
0
21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Haobin Li
Peng Hu
Qianjun Zhang
Xi Peng
Xiting Liu
Mouxing Yang
TTA
87
0
0
21 Oct 2024
Task-oriented Robotic Manipulation with Vision Language Models
Nurhan Bulus Guran
Hanchi Ren
Jingjing Deng
Xianghua Xie
119
4
0
21 Oct 2024
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Y. Cai
Jiangning Zhang
Haoyang He
Xinwei He
Ao Tong
Zhenye Gan
Chengjie Wang
Zhucun Xue
Yong-Jin Liu
X. Bai
VLM
96
6
0
21 Oct 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S Ryoo
Honglu Zhou
Shrikant B. Kendre
Can Qin
Le Xue
...
Kanchana Ranasinghe
Caiming Xiong
Ran Xu
Caiming Xiong
Juan Carlos Niebles
VGen
104
15
0
21 Oct 2024
IPO: Interpretable Prompt Optimization for Vision-Language Models
Yingjun Du
Wenfang Sun
Cees G. M. Snoek
VLM
72
3
0
20 Oct 2024
Scene Graph Generation with Role-Playing Large Language Models
Guikun Chen
Jin Li
Wenguan Wang
VLM
99
9
0
20 Oct 2024
A Survey of Hallucination in Large Visual Language Models
Wei Lan
Wenyi Chen
Qingfeng Chen
Shirui Pan
Huiyu Zhou
Yi-Lun Pan
LRM
92
6
0
20 Oct 2024
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
R. Teo
Tan M. Nguyen
MoE
94
3
0
18 Oct 2024
Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
...
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
Jiankang Deng
MLLM
VLM
72
1
0
18 Oct 2024
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
Muhe Ding
Yang Ma
Pengda Qin
Jianlong Wu
Yuhong Li
Liqiang Nie
78
1
0
18 Oct 2024
Leveraging Large Language Models for Enhancing Public Transit Services
Jiahao Wang
Amer Shalaby
51
2
0
18 Oct 2024
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
Chenhang Cui
An Zhang
Yiyang Zhou
Zhaorun Chen
Gelei Deng
Huaxiu Yao
Tat-Seng Chua
210
8
0
18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
234
31
0
18 Oct 2024
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers
Yuxin Wen
Qingqing Cao
Qichen Fu
Sachin Mehta
Mahyar Najibi
VLM
127
5
0
17 Oct 2024
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang
Xi Feng
Yuelin Bai
Xinrun Du
Jinchang Hou
...
Min Yang
Wenhao Huang
Chenghua Lin
Ge Zhang
Shiwen Ni
ELM
VLM
74
6
0
17 Oct 2024
Exploring the Design Space of Visual Context Representation in Video MLLMs
Yifan Du
Yuqi Huo
K. Zhou
Zijia Zhao
Haoyu Lu
Han Huang
Wayne Xin Zhao
Bin Wang
Weipeng Chen
Ji-Rong Wen
49
2
0
17 Oct 2024
H2OVL-Mississippi Vision Language Models Technical Report
Shaikat Galib
Shanshan Wang
Guanshuo Xu
Pascal Pfeiffer
Ryan Chesler
Mark Landry
Sri Satish Ambati
MLLM
VLM
49
4
0
17 Oct 2024
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
Xiang Wang
Benyou Wang
Qiang Liu
Haoyang Li
LM&MA
ELM
AuLLM
395
1
0
17 Oct 2024
GraspDiffusion: Synthesizing Realistic Whole-body Hand-Object Interaction
Patrick Kwon
Hanbyul Joo
61
4
0
17 Oct 2024
3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
Dewei Zhou
Ji Xie
Zongxin Yang
Yi Yang
DiffM
135
8
0
16 Oct 2024
Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation
Yao Shen
Ziwei Wei
Chunmeng Liu
Shuming Wei
Qi Zhao
Kaiyang Zeng
Guangyao Li
VLM
65
0
0
16 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
141
5
0
16 Oct 2024
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLM
CLIP
137
1
0
16 Oct 2024
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLM
VLM
79
1
0
15 Oct 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao
Yangzhou Liu
Zhe Chen
Guangchen Shi
Wenhai Wang
Danhuai Zhao
Tong Lu
114
9
0
15 Oct 2024
It's Just Another Day: Unique Video Captioning by Discriminative Prompting
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
88
3
0
15 Oct 2024
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Sijie Cheng
Kechen Fang
Yangyang Yu
Sicheng Zhou
Yangqiu Song
Ye Tian
Tingguang Li
Lei Han
Yang Liu
111
10
0
15 Oct 2024
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
Bin Shan
Xiang Fei
Wei Shi
An-Lan Wang
Guozhi Tang
Lei Liao
Jingqun Tang
Xiang Bai
Can Huang
VLM
89
7
0
15 Oct 2024
Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs
Shuo Li
Tao Ji
Xiaoran Fan
Linsheng Lu
L. Yang
...
Yansen Wang
Xiaohui Zhao
Tao Gui
Qi Zhang
Xuanjing Huang
80
1
0
15 Oct 2024
Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language Models
Zhongye Liu
Hongbin Liu
Yuepeng Hu
Zedian Shao
Neil Zhenqiang Gong
VLM
MLLM
51
0
0
15 Oct 2024
DreamSteerer: Enhancing Source Image Conditioned Editability using Personalized Diffusion Models
Zhengyang Yu
Zhaoyuan Yang
Jing Zhang
DiffM
96
3
0
15 Oct 2024
PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model
Shang-Ching Liu
Van-Nhiem Tran
Wenkai Chen
Wei-Lun Cheng
Yen-Lin Huang
I-Bin Liao
Yung-Hui Li
Jianwei Zhang
85
0
0
15 Oct 2024
A Simple Approach to Unifying Diffusion-based Conditional Generation
Xirui Li
Charles Herrmann
Kelvin C.K. Chan
Yinxiao Li
Deqing Sun
Chao Ma
Ming-Hsuan Yang
DiffM
VLM
146
1
0
15 Oct 2024
MEV Capture Through Time-Advantaged Arbitrage
Robin Fritsch
Maria Ines Silva
A. Mamageishvili
Benjamin Livshits
E. Felten
108
3
0
14 Oct 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Kai Han
Jianyuan Guo
Yehui Tang
W. He
Enhua Wu
Yunhe Wang
MLLM
VLM
65
8
0
14 Oct 2024
MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages
Shubhi Bansal
Nishit Sushil Singh
Shahid Shafi Dar
Nagendra Kumar
81
1
0
14 Oct 2024
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation
Kai Zhang
Pengzhen Ren
Bingqian Lin
Junfan Lin
Shikui Ma
Hang Xu
Xiaodan Liang
61
2
0
14 Oct 2024
Previous
1
2
3
...
21
22
23
...
45
46
47
Next