Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 786 papers shown
Title
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
Xiang Liu
Zhaoxiang Liu
Huan Hu
Zezhou Chen
Kohou Wang
Kai Wang
Shiguo Lian
43
1
0
10 Mar 2025
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
Yizheng Sun
Hao Li
Chang Xu
Hongpeng Zhou
R. Batista-Navarro
Riza Batista-Navarro
Jingyuan Sun
62
0
0
09 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Z. Chen
Chunwei Wang
Xiuwei Chen
Hang Xu
J. Han
Xiandan Liang
VLM
71
1
0
09 Mar 2025
CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models
Xiangyu Yin
Jiaxu Liu
Zhen Chen
Jinwei Hu
Yi Dong
Xiaowei Huang
Wenjie Ruan
AAML
50
0
0
08 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yi Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
148
0
0
08 Mar 2025
Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
Meng Wang
Fan Wu
Yunchuan Qin
Ruihui Li
Zhuo Tang
KenLi Li
3DPC
99
0
0
08 Mar 2025
Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
Yanjun Chen
Yirong Sun
Xinghao Chen
Jian Wang
Xiaoyu Shen
W. Li
Wei Zhang
3DV
LRM
64
1
0
08 Mar 2025
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation
Yinuo Liu
Zenghui Yuan
Guiyao Tie
Jiawen Shi
Lichao Sun
Lichao Sun
Neil Zhenqiang Gong
46
1
0
08 Mar 2025
Treble Counterfactual VLMs: A Causal Approach to Hallucination
Li Li
Jiashu Qu
Yuxiao Zhou
Yuehan Qin
Tiankai Yang
Yue Zhao
95
2
0
08 Mar 2025
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions
Chan hur
Jeong-hun Hong
Dong-hun Lee
Dabin Kang
Semin Myeong
Sang-hyo Park
Hyeyoung Park
58
0
0
07 Mar 2025
SpiritSight Agent: Advanced GUI Agent with One Look
Zhiyuan Huang
Ziming Cheng
Junting Pan
Zhaohui Hou
Mingjie Zhan
LLMAG
101
2
0
05 Mar 2025
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Huang Huang
Fangchen Liu
Letian Fu
Tingfan Wu
Mustafa Mukadam
Jitendra Malik
Ken Goldberg
Pieter Abbeel
LM&Ro
VLM
85
6
0
05 Mar 2025
WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation
Dujun Nie
Xianda Guo
Yiqun Duan
Ruijun Zhang
Long Chen
LM&Ro
162
2
0
04 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
69
1
0
03 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li
Boyang Li
CoGe
73
0
0
03 Mar 2025
Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling
Jonathan Fhima
Jan Van Eijgen
Lennert Beeckmans
Thomas Jacobs
Moti Freiman
Luis Filipe Nakayama
Ingeborg Stalmans
Chaim Baskin
Joachim A. Behar
MedIm
69
0
0
03 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
56
1
0
02 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
76
4
0
01 Mar 2025
Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems
Song Xia
Yi Yu
Wenhan Yang
Meiwen Ding
Zhuo Chen
Lingyu Duan
Alex C. Kot
Xudong Jiang
56
2
0
01 Mar 2025
SafeText: Safe Text-to-image Models via Aligning the Text Encoder
Yuepeng Hu
Zhengyuan Jiang
Neil Zhenqiang Gong
66
1
0
28 Feb 2025
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji
Huajie Tan
Jiayu Shi
Xiaoshuai Hao
Yuan Zhang
...
Huaihai Lyu
Xiaolong Zheng
Jiaming Liu
Zhongyuan Wang
Shanghang Zhang
102
8
0
28 Feb 2025
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Zhongyang Li
Ziyue Li
Dinesh Manocha
MoE
53
0
0
27 Feb 2025
MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge
Yuntao Du
Kailin Jiang
Zhi Gao
Chenrui Shi
Zilong Zheng
Siyuan Qi
Qing Li
KELM
73
2
0
27 Feb 2025
Knowledge Bridger: Towards Training-free Missing Multi-modality Completion
Guanzhou Ke
Shengfeng He
Xinyu Wang
Bo Wang
Guoqing Chao
Yuyao Zhang
Yi Xie
HeXing Su
68
0
0
27 Feb 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu
Huan Zhang
AAML
86
0
0
25 Feb 2025
Parameter Efficient Merging for Multimodal Large Language Models with Complementary Parameter Adaptation
Fanhu Zeng
Haiyang Guo
Fei Zhu
Li Shen
Hao Tang
MoMe
54
1
0
24 Feb 2025
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Taiyi Wang
Zhihao Wu
Jianheng Liu
Jianye Hao
Jun Wang
Kun Shao
OffRL
41
13
0
24 Feb 2025
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
Wei Zhao
Pengxiang Ding
Mengdi Zhang
Zhefei Gong
Shuanghao Bai
Han Zhao
Donglin Wang
93
6
0
24 Feb 2025
Tracking the Copyright of Large Vision-Language Models through Parameter Learning Adversarial Images
Yubo Wang
Jianting Tang
Chaohu Liu
Linli Xu
AAML
63
1
0
23 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
174
0
0
21 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
82
8
0
21 Feb 2025
From Correctness to Comprehension: AI Agents for Personalized Error Diagnosis in Education
Yi-Fan Zhang
Hang Li
D. Song
Lichao Sun
Tianlong Xu
Qingsong Wen
LLMAG
LRM
93
2
0
20 Feb 2025
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Bo-Kai Ruan
Hao-Tang Tsui
Yung-Hui Li
Hong-Han Shuai
LM&Ro
86
4
0
20 Feb 2025
Animate Your Thoughts: Decoupled Reconstruction of Dynamic Natural Vision from Slow Brain Activity
Yizhuo Lu
Changde Du
Chong Wang
Xuanliu Zhu
Liuyun Jiang
Xujin Li
Huiguang He
VGen
125
4
0
20 Feb 2025
MatterChat: A Multi-Modal LLM for Material Science
Yingheng Tang
Wenbin Xu
Jie Cao
Jianzhu Ma
Weilu Gao
Steve Farrell
Benjamin Erichson
Michael W. Mahoney
Andy Nonaka
113
3
0
18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Xinyi Yang
Liang Zeng
Heng Dong
Chao Yu
X. Wu
H. Yang
Yu Wang
Milind Tambe
Tonghan Wang
76
2
0
18 Feb 2025
Predicate Hierarchies Improve Few-Shot State Classification
Emily Jin
Joy Hsu
Jiajun Wu
OffRL
79
0
0
18 Feb 2025
Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer
Siqiao Zhao
Zhikang Dong
Zeyu Cao
Raphael Douady
57
6
0
17 Feb 2025
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Zeqing Wang
Wentao Wan
Qiqing Lao
Runmeng Chen
Minjie Lang
Keze Wang
Liang Lin
Liang Lin
LRM
103
3
0
17 Feb 2025
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Granite Vision Team
Leonid Karlinsky
Assaf Arbelle
Abraham Daniels
A. Nassar
...
Sriram Raghavan
T. Syeda-Mahmood
Peter W. J. Staar
Tal Drory
Rogerio Feris
VLM
AI4TS
114
0
0
14 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begüm Demir
Ioannis Papoutsis
VLM
86
0
0
13 Feb 2025
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning
Guoqin Tang
Qingxuan Jia
Zeyuan Huang
Gang Chen
Ning Ji
Zhipeng Yao
66
0
0
13 Feb 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi
Kuan-Chieh Jackson Wang
Guocheng Qian
Hanrong Ye
Runtao Liu
Sergey Tulyakov
Kfir Aberman
Dan Xu
LRM
47
0
0
12 Feb 2025
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
Shivansh Patel
Xinchen Yin
Wenlong Huang
Shubham Garg
H. Nayyeri
Li Fei-Fei
Svetlana Lazebnik
Yongqian Li
92
0
0
12 Feb 2025
Learning Human Skill Generators at Key-Step Levels
Yilu Wu
Chenhui Zhu
Shuai Wang
Hanlin Wang
Jing Wang
Zhaoxiang Zhang
Limin Wang
VGen
119
0
0
12 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
100
0
0
10 Feb 2025
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment
Minh-Quan Le
Gaurav Mittal
Tianjian Meng
A S M Iftekhar
Vishwas Suryanarayanan
Barun Patra
Dimitris Samaras
Mei Chen
DiffM
65
0
0
07 Feb 2025
Vision-Integrated LLMs for Autonomous Driving Assistance : Human Performance Comparison and Trust Evaluation
Namhee Kim
Woojin Park
46
0
0
06 Feb 2025
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Tzu-Tao Chang
Shivaram Venkataraman
VLM
178
0
0
04 Feb 2025
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping
Aashish Rai
Dilin Wang
Mihir Jain
N. Sarafianos
Arthur Chen
Srinath Sridhar
Aayush Prakash
3DGS
74
1
0
03 Feb 2025
Previous
1
2
3
4
5
...
14
15
16
Next