Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,338 papers shown
Title
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Yihe Tang
Wenlong Huang
Yingke Wang
Chengshu Li
Roy Yuan
Ruohan Zhang
Jiajun Wu
Li Fei-Fei
50
0
0
10 Jun 2025
Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu
Luis A. Leiva
19
0
0
10 Jun 2025
SensorLM: Learning the Language of Wearable Sensors
Yuwei Zhang
Kumar Ayush
Siyuan Qiao
A. Heydari
Girish Narayanswamy
...
Shwetak N. Patel
Cecilia Mascolo
Xin Liu
Daniel J. McDuff
Yuzhe Yang
56
0
0
10 Jun 2025
Revolutionizing Clinical Trials: A Manifesto for AI-Driven Transformation
M. Schaar
Richard W. Peck
E. McKinney
Jim Weatherall
Stuart Bailey
...
Rafik Salama
Christina Gunther
Francesca Frau
Antoine Pugeat
Ramon Hernandez
MedIm
69
6
0
10 Jun 2025
Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance
Peilin Li
Jun Yin
Jing Zhong
Ran Luo
Pengyu Zeng
Miao Zhang
13
0
0
09 Jun 2025
LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
Jin Huang
Yuchao Jin
Le An
Josh Park
VLM
14
0
0
09 Jun 2025
SpatialLM: Training Large Language Models for Structured Indoor Modeling
Yongsen Mao
Junhao Zhong
Chuan Fang
Jia Zheng
Rui Tang
Hao Zhu
Ping Tan
Zihan Zhou
3DV
24
1
0
09 Jun 2025
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen
Siran Chen
Kunchang Li
Qinglin Xu
Yu Qiao
Yali Wang
VOS
27
0
0
09 Jun 2025
HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains
Shijie Wang
Yilun Zhang
Zeyu Lai
Dexing Kong
22
0
0
09 Jun 2025
Open World Scene Graph Generation using Vision Language Models
Amartya Dutta
Kazi Sajeed Mehrab
Medha Sawhney
Abhilash Neog
Mridul Khurana
...
Aanish Pradhan
M. Maruf
Ismini Lourentzou
Arka Daw
Anuj Karpatne
VLM
18
0
0
09 Jun 2025
Language-Vision Planner and Executor for Text-to-Visual Reasoning
Yichang Xu
Gaowen Liu
Ramana Rao Kompella
Sihao Hu
Tiansheng Huang
Fatih Ilhan
Selim Furkan Tekin
Zachary Yahn
Ling Liu
LRM
VLM
23
0
0
09 Jun 2025
Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation
H. Kim
Donghyun Kim
Suhyun Kim
DiffM
31
1
0
09 Jun 2025
Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models
Qianshan Wei
Jiaqi Li
Zihan You
Yi Zhan
Kecen Li
...
Yi Yu
Bin Cao
Yiwen Xu
Yang Liu
Guilin Qi
AAML
VLM
21
0
0
08 Jun 2025
Zero Shot Composed Image Retrieval
Santhosh Kakarla
Gautama Shastry Bulusu Venkata
24
0
0
07 Jun 2025
Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models
Mohammad Ali Rezaei
Helia Hajikazem
Saeed Khanehgir
Mahdi Javanmardi
DiffM
26
0
0
07 Jun 2025
Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge
Constantin Patsch
Marsil Zakour
Yuankai Wu
Eckehard G. Steinbach
43
0
0
06 Jun 2025
Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models
Hugues Thomas
Chen Chen
Jian Zhang
42
0
0
06 Jun 2025
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery
Sajjad Abdoli
Freeman Lewin
Gediminas Vasiliauskas
Fabian Schonholz
EGVM
AI4TS
VLM
61
0
0
06 Jun 2025
ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On
Jinjuan Wang
Wenzhang Sun
Ming Li
Y. Zheng
Fanyao Li
Zhulin Tao
Donglin Di
Hao Li
Wei Chen
Xianglin Huang
VGen
AI4TS
60
0
0
06 Jun 2025
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang
Z. Liu
Yongming Rao
Jiwen Lu
VLM
LRM
166
0
0
05 Jun 2025
Track Any Anomalous Object: A Granular Video Anomaly Detection Pipeline
Yuzhi Huang
Chenxin Li
H. Zhang
Zixu Lin
Yunlong Lin
...
Xinyu Liu
Jiechao Gao
Yue Huang
Xinghao Ding
Yixuan Yuan
117
0
0
05 Jun 2025
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding
Yan Shu
Hangui Lin
Yexin Liu
Yan Zhang
Gangyan Zeng
Yan Li
Yu Zhou
Ser-Nam Lim
Harry Yang
N. Sebe
MLLM
VLM
70
0
0
05 Jun 2025
Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model
Zelu Qi
Ping Shi
C. Zhang
Shuqi Wang
F. Zhao
Da Pan
Zefeng Ying
EGVM
VGen
139
0
0
05 Jun 2025
Aligning Multimodal Representations through an Information Bottleneck
Antonio Almudévar
José Miguel Hernández-Lobato
Sameer Khurana
R. Marxer
Alfonso Ortega
SSL
112
0
0
05 Jun 2025
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs
Haoyuan Li
Yanpeng Zhou
Yufei Gao
Tao Tang
J. N. Han
Yujie Yuan
Dave Zhenyu Chen
Jiawang Bian
Hang Xu
Xiaodan Liang
119
0
0
05 Jun 2025
SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents
Alexander Huang-Menders
Xinhang Liu
Andy Xu
Yuyao Zhang
Chi-Keung Tang
Yu-Wing Tai
DiffM
116
0
0
05 Jun 2025
Multimodal Tabular Reasoning with Privileged Structured Information
Jun-Peng Jiang
Yu Xia
Hai-Long Sun
Shiyin Lu
Qing-Guo Chen
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
LMTD
LRM
96
0
0
04 Jun 2025
Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation
Chaehun Shin
Jooyoung Choi
Johan Barthelemy
Jungbeom Lee
Sungroh Yoon
DiffM
83
0
0
04 Jun 2025
Zero-Shot Temporal Interaction Localization for Egocentric Videos
Erhang Zhang
Junyi Ma
Yin-Dong Zheng
Yixuan Zhou
Hesheng Wang
94
0
0
04 Jun 2025
SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models
Arnab Debnath
Gregory J. Stein
Jana Kosecka
LM&Ro
89
0
0
04 Jun 2025
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
77
0
0
04 Jun 2025
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models
Mingzhe Li
Gehao Zhang
Zhenting Wang
Shiqing Ma
Siqi Pan
Richard Cartwright
Juan Zhai
DiffM
54
0
0
03 Jun 2025
Attacking Attention of Foundation Models Disrupts Downstream Tasks
Hondamunige Prasanna Silva
Federico Becattini
Lorenzo Seidenari
AAML
27
0
0
03 Jun 2025
MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection
Juntong Li
Lingwei Dang
Yukun Su
Yun Hao
Qingxin Xiao
Yongwei Nie
Qingyao Wu
64
0
0
03 Jun 2025
SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence
Zhitao Zeng
Zhu Zhuo
Xiaojun Jia
Erli Zhang
Junde Wu
...
Xiaochun Cao
Yutong Ban
Qi Dou
Yang Liu
Yueming Jin
VLM
53
0
0
03 Jun 2025
Self-Supervised Spatial Correspondence Across Modalities
Ayush Shrivastava
Andrew Owens
SSL
49
0
0
03 Jun 2025
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
Jenny Schmalfuss
Nadine Chang
Vibashan VS
Maying Shen
Andrés Bruhn
Jose M. Alvarez
VLM
19
0
0
03 Jun 2025
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models
Yan Shu
Bin Ren
Zhitong Xiong
Danda Pani Paudel
Luc Van Gool
Begüm Demir
N. Sebe
Paolo Rota
VLM
61
0
0
02 Jun 2025
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Zijian Wu
Jinjie Ni
Xiangyan Liu
Zichen Liu
Hang Yan
Michael Shieh
OffRL
ReLM
LRM
39
0
0
02 Jun 2025
Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu
Kai Zhang
Muhao Chen
Yu Su
VLM
54
0
0
02 Jun 2025
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs
Wayner Barrios
Andrés Villa
Juan Carlos León Alcázar
SouYoung Jin
Bernard Ghanem
57
0
0
02 Jun 2025
Data Pruning by Information Maximization
Haoru Tan
Sitong Wu
Wei Huang
Shizhen Zhao
Xiaojuan Qi
61
1
0
02 Jun 2025
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Hyojin Bahng
Caroline Chan
F. Durand
Phillip Isola
EGVM
29
0
0
02 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
51
0
0
02 Jun 2025
Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos
Aditi Tiwari
Farzaneh Masoud
Dac Trong Nguyen
Jill Kraft
Heng Ji
Klara Nahrstedt
36
0
0
02 Jun 2025
Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review
Yuchen Fang
Hao Miao
Yuxuan Liang
Liwei Deng
Yue Cui
...
Yan Zhao
T. Pedersen
Christian S. Jensen
Xiaofang Zhou
Kai Zheng
AI4TS
AI4CE
70
0
0
02 Jun 2025
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective
Lei Lei
Jie Gu
Xiaokang Ma
Chu Tang
Jingmin Chen
Tong Xu
43
1
0
01 Jun 2025
Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Yunqi Hong
Sohyun An
Andrew Bai
Neil Y. C. Lin
Cho-Jui Hsieh
VLM
33
0
0
01 Jun 2025
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
Yunzhu Zhang
Yu Lu
T. Wang
Fengyun Rao
Yi Yang
Linchao Zhu
VLM
44
0
0
01 Jun 2025
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Yuting Zhang
Hao Lu
Qingyong Hu
Yin Wang
Kaishen Yuan
Xin Liu
Kaishun Wu
MLLM
LRM
38
0
0
30 May 2025
Previous
1
2
3
4
5
...
45
46
47
Next