Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,352 papers shown
Title
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
50
0
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOS
LRM
178
5
0
18 Jul 2024
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Mingchen Zhuge
Jian Ding
Deyao Zhu
Jürgen Schmidhuber
Mohamed Elhoseiny
VLM
75
20
0
17 Jul 2024
E5-V: Universal Embeddings with Multimodal Large Language Models
Ting Jiang
Minghui Song
Zihan Zhang
Haizhen Huang
Weiwei Deng
Feng Sun
Qi Zhang
Deqing Wang
Fuzhen Zhuang
VLM
103
34
0
17 Jul 2024
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
Gengze Zhou
Yicong Hong
Zun Wang
Xin Eric Wang
Qi Wu
LM&Ro
96
30
0
17 Jul 2024
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions
Seokha Moon
Hyun Woo
Hongbeen Park
Haeji Jung
R. Mahjourian
Hyung-Gun Chi
Hyerin Lim
Sangpil Kim
Jinkyu Kim
78
7
0
17 Jul 2024
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Yilin Ye
Shishi Xiao
Xingchen Zeng
Wei Zeng
116
5
0
17 Jul 2024
DreamStory: Open-Domain Story Visualization by LLM-Guided Multi-Subject Consistent Diffusion
Huiguo He
Huan Yang
Zixi Tuo
Yuan Zhou
Qiuyue Wang
Yuhang Zhang
Zeyu Liu
Wenhao Huang
Hongyang Chao
Jian Yin
DiffM
VGen
200
17
0
17 Jul 2024
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
Ofir Abramovich
Niv Nayman
Sharon Fogel
I. Lavi
Ron Litman
Shahar Tsiper
Royee Tichauer
Srikar Appalaraju
Shai Mazor
R. Manmatha
VLM
109
3
0
17 Jul 2024
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju
Haicheng Wang
Haozhe Cheng
Xu Chen
Zhonghua Zhai
Weilin Huang
Jinsong Lan
Shuai Xiao
Bo Zheng
VLM
98
6
0
16 Jul 2024
DiNO-Diffusion. Scaling Medical Diffusion via Self-Supervised Pre-Training
Guillermo Jiménez-Pérez
Pedro Osório
Josef Cersovsky
Javier Montalt-Tordera
Jens Hooge
Steffen Vogler
Sadegh Mohammadi
MedIm
100
2
0
16 Jul 2024
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
Pengxiang Li
Zhi Gao
Bofei Zhang
Tao Yuan
Yuwei Wu
Mehrtash Harandi
Yunde Jia
Song-Chun Zhu
Qing Li
VLM
MLLM
102
6
0
16 Jul 2024
How Control Information Influences Multilingual Text Image Generation and Editing?
Boqiang Zhang
Zuan Gao
Yadong Qu
Hongtao Xie
DiffM
95
5
0
16 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
76
1
0
16 Jul 2024
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models
Jinrui Zhang
Teng Wang
Haigang Zhang
Ping Lu
Feng Zheng
MLLM
LRM
VLM
90
4
0
16 Jul 2024
VISA: Reasoning Video Object Segmentation via Large Language Models
Cilin Yan
Haochen Wang
Shilin Yan
Xiaolong Jiang
Yao Hu
Guoliang Kang
Weidi Xie
E. Gavves
LRM
VLM
VOS
110
41
0
16 Jul 2024
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models
Zijian Zhou
Zheng Zhu
Holger Caesar
Miaojing Shi
VLM
100
3
0
15 Jul 2024
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
Rishika Bhagwatkar
Shravan Nayak
Reza Bayat
Alexis Roger
Daniel Z Kaplan
P. Bashivan
Irina Rish
AAML
VLM
86
2
0
15 Jul 2024
OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting
Penglei Gao
Kai Yao
Tiandi Ye
Steven Wang
Yuan Yao
Xiaofeng Wang
Mamba
70
3
0
15 Jul 2024
FabGPT: An Efficient Large Multimodal Model for Complex Wafer Defect Knowledge Queries
Yuqi Jiang
Xudong Lu
Qian Jin
Qi Sun
Hanming Wu
Cheng Zhuo
120
7
0
15 Jul 2024
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
Yuchen Yang
Kwonjoon Lee
Behzad Dariush
Yinzhi Cao
Shao-Yuan Lo
LRM
99
19
0
14 Jul 2024
Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding
Ruihuang Li
Zhengqiang Zhang
Chenhang He
Zhiyuan Ma
Vishal M. Patel
Lei Zhang
3DV
VLM
98
6
0
13 Jul 2024
Machine Apophenia: The Kaleidoscopic Generation of Architectural Images
Alexey Tikhonov
Dmitry Sinyavin
97
0
0
12 Jul 2024
FD-SOS: Vision-Language Open-Set Detectors for Bone Fenestration and Dehiscence Detection from Intraoral Images
Marawan Elbatel
Keyuan Liu
Yanqi Yang
Xuelong Li
60
0
0
12 Jul 2024
PersonificationNet: Making customized subject act like a person
Tianchu Guo
Pengyu Li
Biao Wang
Xiansheng Hua
46
0
0
12 Jul 2024
Refusing Safe Prompts for Multi-modal Large Language Models
Zedian Shao
Hongbin Liu
Yuepeng Hu
Neil Zhenqiang Gong
MLLM
LRM
82
1
0
12 Jul 2024
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort
Jeeyung Kim
Ze Wang
Qiang Qiu
81
2
0
12 Jul 2024
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
Lecheng Kong
Jiarui Feng
Hao Liu
Chengsong Huang
Jiaxin Huang
Yixin Chen
Muhan Zhang
AI4CE
157
13
0
12 Jul 2024
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang
Xinpeng Ding
Chunwei Wang
J. N. Han
Yulong Liu
Hengshuang Zhao
Hang Xu
Lu Hou
Wei Zhang
Xiaodan Liang
VLM
86
9
0
11 Jul 2024
SEED-Story: Multimodal Long Story Generation with Large Language Model
Shuai Yang
Yuying Ge
Yang Li
Yukang Chen
Yixiao Ge
Ying Shan
Yingcong Chen
VGen
DiffM
146
32
0
11 Jul 2024
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Zhen Qin
Daoyuan Chen
Wenhao Zhang
Liuyi Yao
Yilun Huang
Bolin Ding
Yaliang Li
Shuiguang Deng
149
7
0
11 Jul 2024
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Xiaotong Li
Fan Zhang
Haiwen Diao
Yueze Wang
Xinlong Wang
Ling-yu Duan
VLM
121
32
0
11 Jul 2024
Enriching Information and Preserving Semantic Consistency in Expanding Curvilinear Object Segmentation Datasets
Qin Lei
Jiang Zhong
Qizhu Dai
DiffM
76
3
0
11 Jul 2024
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Minghui Wu
Chenxu Zhao
Anyang Su
Donglin Di
Tianyu Fu
...
Min He
Ya Gao
Meng Ma
Kun Yan
Ping Wang
82
1
0
11 Jul 2024
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement
Zijie Yue
Miaojing Shi
Hanli Wang
Shuai Ding
Qijun Chen
Shanlin Yang
113
0
0
11 Jul 2024
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski
William Chen
Karl Pertsch
Oier Mees
Chelsea Finn
Sergey Levine
LRM
LM&Ro
165
88
0
11 Jul 2024
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning
Haiwen Diao
Bo Wan
Xu Jia
Yunzhi Zhuge
Ying Zhang
Huchuan Lu
Long Chen
VLM
95
4
0
10 Jul 2024
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends
Daizong Liu
Mingyu Yang
Xiaoye Qu
Pan Zhou
Yu Cheng
Wei Hu
ELM
AAML
108
33
0
10 Jul 2024
LEMoN: Label Error Detection using Multimodal Neighbors
Haoran Zhang
Aparna Balagopalan
Nassim Oufattole
Hyewon Jeong
Yan Wu
Jiacheng Zhu
Marzyeh Ghassemi
134
0
0
10 Jul 2024
CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model
Xiaoding Yuan
Shitao Tang
Kejie Li
Alan Yuille
Peng Wang
DiffM
87
3
0
09 Jul 2024
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Wenqi Zhang
Zhenglin Cheng
Yuanyu He
Mengna Wang
Yongliang Shen
...
Guiyang Hou
Mingqian He
Yanna Ma
Weiming Lu
Yueting Zhuang
SyDa
184
13
0
09 Jul 2024
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
Yang Liu
Weixing Chen
Yongjie Bai
Xiaodan Liang
Guanbin Li
Wen Gao
Liang Lin
LM&Ro
SyDa
AI4CE
161
70
0
09 Jul 2024
CEIA: CLIP-Based Event-Image Alignment for Open-World Event-Based Understanding
Wenhao Xu
Wenming Weng
Yueyi Zhang
Zhiwei Xiong
VLM
73
0
0
09 Jul 2024
VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving
Yibo Liu
Zheyuan Yang
Guile Wu
Y. Ren
Kejian Lin
Bingbing Liu
Yang Liu
Jinjun Shan
78
6
0
09 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
150
5
0
09 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Hao Peng
Heng Ji
LRM
107
17
0
08 Jul 2024
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Xuan Ju
Yiming Gao
Zhaoyang Zhang
Ziyang Yuan
Xintao Wang
Ailing Zeng
Yu Xiong
Qiang Xu
Ying Shan
VGen
122
47
0
08 Jul 2024
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar
Xiaohan Wang
Yonatan Bitton
Idan Szpektor
Serena Yeung-Levy
VLM
LRM
104
8
0
08 Jul 2024
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Yu Zeng
Vishal M. Patel
Haochen Wang
Xun Huang
Ting-Chun Wang
Xuan Li
Yogesh Balaji
DiffM
73
23
0
08 Jul 2024
Vision-Language Models under Cultural and Inclusive Considerations
Antonia Karamolegkou
Phillip Rust
Yong Cao
Ruixiang Cui
Anders Søgaard
Daniel Hershcovich
VLM
117
8
0
08 Jul 2024
Previous
1
2
3
...
29
30
31
...
46
47
48
Next