Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,347 papers shown
Title
KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension
Jie Yang
Wang Zeng
Sheng Jin
Lumin Xu
Wentao Liu
Chen Qian
Ruimao Zhang
MLLM
133
3
0
04 Nov 2024
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering
Deepayan Das
Davide Talon
Massimiliano Mancini
Yiming Wang
Elisa Ricci
128
0
0
04 Nov 2024
A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning
Fei Wang
Chong Chen
Hongyu Chen
Yugang Chang
Weiming Zeng
45
1
0
03 Nov 2024
DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion
Weicai Ye
Chenhao Ji
Zheng Chen
Junyao Gao
Xiaoshui Huang
Song-Hai Zhang
Wanli Ouyang
Tong He
Cairong Zhao
Guofeng Zhang
94
11
0
31 Oct 2024
Technical Report for Soccernet 2023 -- Dense Video Captioning
Zheng Ruan
Ruixuan Liu
Shimin Chen
Mengying Zhou
Xinquan Yang
Wei Li
Chong Chen
Wei Shen
28
0
0
31 Oct 2024
MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption
Ruixun Liu
Kaiyu Li
Jiayi Song
Dongwei Sun
Xiangyong Cao
VGen
87
1
0
31 Oct 2024
EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
Qinqian Lei
Bo Wang
Robby T. Tan
VLM
94
5
0
31 Oct 2024
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding
Jinlong He
Pengfei Li
Gang Liu
Shenjun Zhong
86
3
0
31 Oct 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
77
1
0
31 Oct 2024
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map
Xinyuan Chang
Maixuan Xue
Xinran Liu
Zheng Pan
Xing Wei
217
2
0
31 Oct 2024
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Xiufeng Song
Xiao Guo
Junxuan Zhang
Qirui Li
Lei Bai
Xiaoming Liu
Guangtao Zhai
Xiaohong Liu
VGen
DiffM
180
12
0
31 Oct 2024
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
Jie Zhu
Yukang Chen
Mingyu Ding
Ping Luo
Leye Wang
Jingdong Wang
DiffM
69
5
0
30 Oct 2024
MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering
Yizhen Luo
Zikun Nie
Massimo Hong
Suyuan Zhao
Hao Zhou
Zaiqing Nie
149
1
0
30 Oct 2024
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
74
4
0
30 Oct 2024
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
J. Wu
Tsz Ting Chung
Kai Chen
Dit-Yan Yeung
LRM
VLM
131
4
0
30 Oct 2024
Multimodal Structure Preservation Learning
Chang Liu
Jieshi Chen
Lee H. Harrison
A. Dubrawski
68
0
0
29 Oct 2024
Natural Language Inference Improves Compositionality in Vision-Language Models
Paola Cascante-Bonilla
Yu Hou
Yang Trista Cao
Hal Daumé III
Rachel Rudinger
ReLM
CoGe
VLM
83
4
0
29 Oct 2024
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang
Shaoyu Chen
Bencheng Liao
Xingyu Zhang
Wei Yin
Qian Zhang
Chang Huang
Wen Liu
Xinyu Wang
VLM
MLLM
LRM
115
31
0
29 Oct 2024
Active Learning for Vision-Language Models
Bardia Safaei
Vishal M. Patel
VLM
119
7
0
29 Oct 2024
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Wanhua Li
Zibin Meng
Jiawei Zhou
D. Wei
Chuang Gan
Hanspeter Pfister
LRM
VLM
58
6
0
28 Oct 2024
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines
Zhixin Zhang
Yiyuan Zhang
Xiaohan Ding
Xiangyu Yue
75
4
0
28 Oct 2024
Improving Generalization in Visual Reasoning via Self-Ensemble
Tien-Huy Nguyen
Quang-Khai Tran
Anh-Tuan Quang-Hoang
VLM
LRM
126
6
0
28 Oct 2024
Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments
Sangmim Song
S. Kodagoda
A. Gunatilake
Marc G. Carmichael
Karthick Thiyagarajan
Jodi Martin
LM&Ro
161
1
0
28 Oct 2024
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Hanyu Wang
Saksham Suri
Yixuan Ren
Hao Chen
Abhinav Shrivastava
VGen
107
12
0
28 Oct 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLM
AAML
113
7
0
28 Oct 2024
You Never Know: Quantization Induces Inconsistent Biases in Vision-Language Foundation Models
Eric Slyman
Anirudh Kanneganti
Sanghyun Hong
Stefan Lee
VLM
MQ
103
0
0
26 Oct 2024
EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering
Kai Cheng
Zhengyuan Li
Xingpeng Sun
Byung-Cheol Min
Amrit Singh Bedi
Aniket Bera
67
3
0
26 Oct 2024
MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation
Jialin Luo
Yuanzhi Wang
Ziqi Gu
Yide Qiu
Shuaizhen Yao
Fuyun Wang
Chunyan Xu
Wenhua Zhang
Dan Wang
Zhen Cui
DiffM
54
2
0
26 Oct 2024
Transferable Adversarial Attacks on SAM and Its Downstream Models
Song Xia
Wenhan Yang
Yi Yu
Xun Lin
Henghui Ding
Lingyu Duan
Xudong Jiang
AAML
SILM
121
6
0
26 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
128
1
0
26 Oct 2024
Improving Multimodal Large Language Models Using Continual Learning
Shikhar Srivastava
Md Yousuf Harun
Robik Shrestha
Christopher Kanan
KELM
VLM
CLL
75
1
0
25 Oct 2024
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis
N. V. R. Chappa
P. Dobbs
Bhiksha Raj
Khoa Luu
91
3
0
25 Oct 2024
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Fengbin Zhu
Ziyang Liu
Xiang Yao Ng
Haohui Wu
Wenjie Wang
Fuli Feng
Chao Wang
Huanbo Luan
Tat-Seng Chua
VLM
102
3
0
25 Oct 2024
Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models
Shenghao Fu
Junkai Yan
Q. Yang
Xihan Wei
Xiaohua Xie
Wei-Shi Zheng
VLM
111
3
0
25 Oct 2024
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction
Z. Gong
Guangyin Bao
Qi Zhang
Zhongwei Wan
Duoqian Miao
...
Changwei Wang
Rongtao Xu
Liang Hu
Ke Liu
Yu Zhang
DiffM
VGen
124
10
0
25 Oct 2024
Context-Based Visual-Language Place Recognition
Soojin Woo
Seong-Woo Kim
54
1
0
25 Oct 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Xiangyu Zeng
Kunchang Li
Chenting Wang
Xinhao Li
Tianxiang Jiang
...
Zhengrong Yue
Yi Wang
Yali Wang
Yu Qiao
Limin Wang
MLLM
VLM
AI4TS
118
18
0
25 Oct 2024
Revealing and Reducing Gender Biases in Vision and Language Assistants (VLAs)
Leander Girrbach
Yiran Huang
Stephan Alaniz
Trevor Darrell
Zeynep Akata
VLM
142
2
0
25 Oct 2024
BIFRÖST: 3D-Aware Image compositing with Language Instructions
Lingxiao Li
Kaixiong Gong
Weihong Li
Xili Dai
Tao Chen
Xiaojun Yuan
Xiangyu Yue
103
2
0
24 Oct 2024
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura
Ahmed Heakl
Omkar Thawakar
Ali Alharthi
Ines Riahi
Abduljalil Saif
Jorma T. Laaksonen
Fahad Shahbaz Khan
Salman Khan
Rao Muhammad Anwer
84
3
0
24 Oct 2024
Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning Techniques
David Ortiz-Perez
Manuel Benavent-Lledo
José García Rodríguez
David Tomás
M. Flores Vizcaya-Moreno
69
1
0
24 Oct 2024
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning
Xiaoqiang Wang
Bang Liu
LLMAG
LM&Ro
LRM
118
12
0
24 Oct 2024
SegLLM: Multi-round Reasoning Segmentation
XuDong Wang
Shaolun Zhang
Shufan Li
Konstantinos Kallidromitis
Kehan Li
Yusuke Kato
Kazuki Kozuka
Trevor Darrell
VLM
LRM
115
2
0
24 Oct 2024
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
48
3
0
24 Oct 2024
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics
Jinghao Hu
Yuhe Zhang
Guohua Geng
Liuyuxin Yang
JiaRui Yan
Jingtao Cheng
YaDong Zhang
Kang Li
DiffM
45
0
0
24 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
181
7
0
24 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
97
0
0
23 Oct 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Haodong Duan
Zeang Sheng
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
105
12
0
23 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
160
8
0
23 Oct 2024
Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations
Cheng Lei
Jie Fan
Xinran Li
Tianzhu Xiang
Ao Li
Ce Zhu
Le Zhang
74
0
0
22 Oct 2024
Previous
1
2
3
...
20
21
22
...
45
46
47
Next