Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
Large Language Models for Video Surveillance Applications
Ulindu De Silva
Leon Fernando
Billy Lau Pik Lik
Zann Koh
Sam Conrad Joyce
Belinda Yuen
Chau Yuen
46
1
0
06 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Yang Liu
126
0
0
06 Jan 2025
Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors
Minglin Chen
Longguang Wang
Sheng Ao
Ye Zhang
Kai Xu
Yulan Guo
DiffM
478
0
0
05 Jan 2025
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang
Zhemeng Yu
Gabriele Spadaro
Chen Ju
Victor Quétu
Enzo Tartaglione
Enzo Tartaglione
VLM
433
6
0
05 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
128
0
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
363
59
0
03 Jan 2025
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Lijie Tao
Han Zhang
Haizhao Jing
Yu Liu
Kelu Yao
Guoting Wei
Xizhe Xue
147
0
0
03 Jan 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Wenhu Chen
MLLM
VLM
199
29
0
03 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffM
VLM
139
0
0
03 Jan 2025
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang
Xue Jiang
Zhiqiang Wang
Wentao Mo
Xi Xiao
Bo Han
Yongjie Yin
Feng Zheng
AAML
158
4
0
02 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
229
16
0
02 Jan 2025
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Yuanpeng Tu
Hao Luo
Xi Chen
S. Ji
Xiang Bai
Hengshuang Zhao
DiffM
VGen
162
6
0
02 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
173
2
0
01 Jan 2025
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
Junxiao Xue
Quan Deng
Fei Yu
Yanhao Wang
Jun Wang
Yongqian Li
VLM
129
5
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
134
26
0
31 Dec 2024
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
284
5
0
31 Dec 2024
ChartAdapter: Large Vision-Language Model for Chart Summarization
Peixin Xu
Yujuan Ding
Wenqi Fan
93
2
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
149
30
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
190
42
0
31 Dec 2024
UniRestorer: Universal Image Restoration via Adaptively Estimating Image Degradation at Proper Granularity
Jingbo Lin
Zhilu Zhang
Wenbo Li
Renjing Pei
Hang Xu
Hongzhi Zhang
Wangmeng Zuo
116
1
0
28 Dec 2024
When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
VLM
85
1
0
26 Dec 2024
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
146
2
0
25 Dec 2024
PRISM: Efficient Long-Range Reasoning With Short-Context LLMs
Dulhan Jayalath
James Bradley Wendt
Nicholas Monath
Sandeep Tata
Beliz Gunel
CLL
LRM
74
0
0
25 Dec 2024
Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach
Jing Bi
Junjia Guo
Yunlong Tang
Lianggong Wen
Zhang Liu
Chenliang Xu
52
6
0
24 Dec 2024
Personalized Large Vision-Language Models
Chau Pham
Hoang Phan
David Doermann
Yunjie Tian
VLM
117
4
0
23 Dec 2024
Multimodal Preference Data Synthetic Alignment with Reward Model
Robert Wijaya
Ngoc-Bao Nguyen
Ngai-Man Cheung
MLLM
SyDa
133
4
0
23 Dec 2024
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
Se Jin Park
Yeonju Kim
Hyeongseop Rha
Bella Godiva
Y. Ro
76
1
0
23 Dec 2024
Neural-MCRL: Neural Multimodal Contrastive Representation Learning for EEG-based Visual Decoding
Yueyang Li
Zijian Kang
Shengyu Gong
Wenhao Dong
Weiming Zeng
Hongjie Yan
W. Siok
Nizhuan Wang
119
2
0
23 Dec 2024
VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang
Junliang Guo
Xinyi Xie
Tianyu He
Xu Sun
Li Zhao
DRL
VGen
163
5
0
23 Dec 2024
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models
Yeyuan Wang
D. Gao
Bin Li
Rujiao Long
Lei Yi
Xiaoyan Cai
Libin Yang
Jinxia Zhang
Shanqing Yu
Qi Xuan
107
1
0
22 Dec 2024
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham
Hoang-Nam Le
Phu-Vinh Nguyen
Chris Ngo
Truong-Son Hy
AuLLM
LRM
144
1
0
21 Dec 2024
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
Gautier Evennou
Antoine Chaffin
Vivien Chappelier
Ewa Kijak
DiffM
125
0
0
20 Dec 2024
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Zhiqiang Tang
Zihan Zhong
Tong He
Gerald Friedland
166
1
0
19 Dec 2024
Dataset Augmentation by Mixing Visual Concepts
Abdullah Al Rahat
Hemanth Venkateswara
DiffM
116
0
0
19 Dec 2024
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
J. Zhang
Li Zhang
Shijian Li
VLM
177
0
0
18 Dec 2024
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Shilin Sun
Wenbin An
Feng Tian
Fang Nan
Qidong Liu
Jing Liu
N. Shah
Ping Chen
157
6
0
18 Dec 2024
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei
Yujie Zhong
Haoxian Tan
Yingsen Zeng
Yong Liu
Zheng Zhao
Yujiu Yang
MLLM
VLM
VOS
152
3
0
18 Dec 2024
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adriana Romero Soriano
Adina Williams
165
4
0
18 Dec 2024
LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications Achieving High Spectrum Efficiency
Achintha Wijesinghe
Suchinthaka Wanninayaka
Weiwei Wang
Yu-Chieh Chao
Songyang Zhang
Zhi Ding
132
1
0
18 Dec 2024
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang
Shusheng Yang
Anjali W. Gupta
Rilyn Han
Li Fei-Fei
Saining Xie
LRM
212
107
0
18 Dec 2024
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Le Yang
Ziwei Zheng
Boxu Chen
Zhengyu Zhao
Chenhao Lin
Chao Shen
VLM
317
7
0
18 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
143
11
0
17 Dec 2024
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh
Hwanhee Jung
Jong Wook Kim
Seanie Lee
Innfarn Yoo
Andreas Lugmayr
Seunggeun Chi
K. Ramani
Sangpil Kim
3DGS
171
2
0
17 Dec 2024
OmniPrism: Learning Disentangled Visual Concept for Image Generation
Yangyang Li
Daqing Liu
Wu Liu
Allen He
Xinchen Liu
Yongdong Zhang
Guoqing Jin
DiffM
CoGe
100
0
0
16 Dec 2024
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
Renqiu Xia
Mingxing Li
Hancheng Ye
Wenjie Wu
Hongbin Zhou
...
Zeang Sheng
Botian Shi
Tao Chen
Junchi Yan
Bo Zhang
196
10
0
16 Dec 2024
StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors
Xiaokun Sun
Zeyu Cai
Zhenyu Zhang
Ying Tai
Jian Yang
133
0
0
16 Dec 2024
CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution
Bingwen Hu
Heng Liu
Zhedong Zheng
Ping Liu
SupR
263
0
0
16 Dec 2024
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes
Ruijie Lu
Yixin Chen
Junfeng Ni
Baoxiong Jia
Yu Liu
Diwen Wan
Gang Zeng
Siyuan Huang
DiffM
235
4
0
16 Dec 2024
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu
Wenhang Ge
Dicong Qiu
ZhiFei Chen
Dongyu Yan
...
Haoyu Zhao
HanFeng Zhao
Shunsi Zhang
Junwei Liang
Ying-Cong Chen
3DGS
126
1
0
15 Dec 2024
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Yuanmin Tang
Xiaoting Qin
Jing Zhang
Jing Yu
Gaopeng Gou
Gang Xiong
Qingwei Ling
Saravan Rajmohan
Dongmei Zhang
Qi Wu
LRM
113
1
0
15 Dec 2024
Previous
1
2
3
...
16
17
18
...
45
46
47
Next