Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
Ryota Tanaka
Taichi Iki
Taku Hasegawa
Kyosuke Nishida
Kuniko Saito
Jun Suzuki
VLM
118
6
0
14 Apr 2025
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
Tan-Hanh Pham
Chris Ngo
Trong-Duong Bui
Minh Luu Quang
Tan-Huong Pham
Truong-Son Hy
122
2
0
14 Apr 2025
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Kaiwen Zheng
Xuri Ge
Junchen Fu
Jun Peng
J. Jose
CVBM
68
0
0
14 Apr 2025
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval
Yating Liu
Yaowei Li
Xiangyuan Lan
Wenming Yang
Zimo Liu
Q. Liao
86
0
0
14 Apr 2025
Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge
Maria Tzelepi
Vasileios Mezaris
110
1
0
14 Apr 2025
Art3D: Training-Free 3D Generation from Flat-Colored Illustration
Xiaoyan Cong
Jiayi Shen
Zekun Li
Rao Fu
Tao Lu
Srinath Sridhar
3DH
88
0
0
14 Apr 2025
GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model
Wenji Fang
Jing Wang
Yao Lu
Shang Liu
Zhiyao Xie
AI4CE
89
1
0
13 Apr 2025
DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning
Yining Zhao
Ali Braytee
Mukesh Prasad
VLM
MedIm
42
0
0
13 Apr 2025
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
Kaiyu Li
Zepeng Xin
Li Pang
Chao Pang
Yupeng Deng
Jing Yao
Guisong Xia
Deyu Meng
Zhi Wang
Xiangyong Cao
VLM
LRM
101
4
0
13 Apr 2025
Evolved Hierarchical Masking for Self-Supervised Learning
Zhanzhou Feng
Shiliang Zhang
141
0
0
12 Apr 2025
Using Vision Language Models for Safety Hazard Identification in Construction
Muhammad Adil
Gaang Lee
Vicente A. Gonzalez
Qipei Mei
100
1
0
12 Apr 2025
SDIGLM: Leveraging Large Language Models and Multi-Modal Chain of Thought for Structural Damage Identification
Yanzhe Zhang
Shiyin Wei
Yong Huang
Yawu Su
Shanshan Lu
Hui Li
AI4CE
70
0
0
12 Apr 2025
Visual moral inference and communication
Warren Zhu
Aida Ramezani
Yang Xu
68
0
0
12 Apr 2025
CL-CoTNav: Closed-Loop Hierarchical Chain-of-Thought for Zero-Shot Object-Goal Navigation with Vision-Language Models
Yuxin Cai
Xiangkun He
Maonan Wang
Hongliang Guo
W. Yau
Chen Lv
LM&Ro
LRM
113
1
0
11 Apr 2025
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Tianwei Xiong
Jun Hao Liew
Zilong Huang
Jiashi Feng
Xihui Liu
89
1
0
11 Apr 2025
On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora
Kai-Wei Chang
Chung-Ming Chien
Yifan Peng
Haibin Wu
Yossi Adi
Emmanuel Dupoux
Hung-yi Lee
Karen Livescu
Shinji Watanabe
155
14
0
11 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
472
0
0
11 Apr 2025
Spatial Audio Processing with Large Language Model on Wearable Devices
Ayushi Mishra
Yang Bai
Priyadarshan Narayanasamy
Nakul Garg
Nirupam Roy
107
1
0
11 Apr 2025
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Boyang Deng
Songyou Peng
Kyle Genova
Gordon Wetzstein
Noah Snavely
Leonidas Guibas
Thomas Funkhouser
HAI
458
0
0
11 Apr 2025
PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models
M. Dhouib
Davide Buscaldi
Sonia Vanier
A. Shabou
VLM
107
1
0
11 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
88
0
0
11 Apr 2025
FlexIP: Dynamic Control of Preservation and Personality for Customized Image Generation
Linyan Huang
Haonan Lin
Yanning Zhou
Kaiwen Xiao
105
1
0
10 Apr 2025
Data Metabolism: An Efficient Data Design Schema For Vision Language Model
Jingyuan Zhang
Hongzhi Zhang
Zhou Haonan
Chenxi Sun
Xingguang Ji
Jiakang Wang
Fanheng Kong
Yang Liu
Qi Wang
Fuzheng Zhang
VLM
145
2
0
10 Apr 2025
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs
Zijian Zhang
Xuhui Zheng
X. Wu
Chong Peng
Xuezhi Cao
71
2
0
10 Apr 2025
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Dibyadip Chatterjee
Edoardo Remelli
Yale Song
Bugra Tekin
Abhay Mittal
...
Shreyas Hampali
Eric Sauser
Shugao Ma
Angela Yao
Fadime Sener
VLM
101
0
0
10 Apr 2025
How Can Objects Help Video-Language Understanding?
Zitian Tang
Shijie Wang
Junho Cho
Jaewook Yoo
Chen Sun
119
1
0
10 Apr 2025
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
Henghao Zhao
Ge-Peng Ji
Rui Yan
Huan Xiong
Zechao Li
76
1
0
10 Apr 2025
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
71
2
0
10 Apr 2025
Impact of Language Guidance: A Reproducibility Study
Cherish Puniani
Advika Sinha
Shree Singhi
Aayan Yadav
VLM
206
0
0
10 Apr 2025
Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition
Sergio Romero-Tapiador
Ruben Tolosana
Blanca Lacruz-Pleguezuelos
L. Marcos-Zambrano
Guadalupe X.Bazán
Isabel Espinosa-Salinas
Julian Fierrez
Javier-Ortega Garcia
Enrique Carrillo-de Santa Pau
Aythami Morales
CoGe
72
0
0
09 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Yue Zhao
Shuang Xu
Bo Xu
VLM
99
0
0
09 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
161
0
0
09 Apr 2025
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
129
1
0
09 Apr 2025
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception
Ruotian Peng
Haiying He
Yake Wei
Yandong Wen
D. Hu
VLM
72
0
0
09 Apr 2025
Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition
Tom Simon
William Mocaer
Pierrick Tranouez
Clément Chatelain
Thierry Paquet
MLLM
VLM
92
0
0
09 Apr 2025
Resource-efficient Inference with Foundation Model Programs
Lunyiu Nie
Zhimin Ding
Kevin Yu
Marco Cheung
C. Jermaine
S. Chaudhuri
73
0
0
09 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
Kai Zhang
Jinahua Han
Lanqing Hong
Hang Xu
Xuelong Li
MLLM
VLM
502
0
0
08 Apr 2025
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
Hritam Basak
Zhaozheng Yin
VLM
78
0
0
08 Apr 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
Hao Du
Bo Wu
Yan Lu
Zhendong Mao
83
0
0
08 Apr 2025
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Yiying Yang
Wei Cheng
Sijin Chen
Xianfang Zeng
Jiaxu Zhang
Liao Wang
Gang Yu
Xingjun Ma
Xingjun Ma
Yu Jiang
VLM
127
6
0
08 Apr 2025
BRIDGES: Bridging Graph Modality and Large Language Models within EDA Tasks
Wei Li
Yang Zou
Christopher Ellis
Ruben Purdy
Shawn Blanton
José M. F. Moura
50
0
0
07 Apr 2025
Video-Bench: Human-Aligned Video Generation Benchmark
Hui Han
Siyuan Li
Jiaqi Chen
Yiwen Yuan
Yuling Wu
...
You Li
Jing Zhang
Chi Zhang
Li Li
Yongxin Ni
EGVM
VGen
208
0
0
07 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
394
0
0
07 Apr 2025
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson
Kevin Du
Niklas Stoehr
Serge Belongie
Ryan Cotterell
Nico Lang
Stella Frank
92
2
0
07 Apr 2025
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
Sai Kumar Dwivedi
Dimitrije Antić
Shashank Tripathi
Omid Taheri
Cordelia Schmid
M. Black
Dimitrios Tzionas
102
1
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
107
1
0
07 Apr 2025
URECA: Unique Region Caption Anything
Sangbeom Lim
J. Kim
Heeji Yoon
Jaewoo Jung
Seungryong Kim
115
1
0
07 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
99
16
0
07 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Mario Sznaier
71
0
0
07 Apr 2025
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu
Quyu Kong
Kechun Xu
Xunlong Xia
Bing Deng
Jieping Ye
R. Xiong
Yansen Wang
72
0
0
07 Apr 2025
Previous
1
2
3
...
7
8
9
...
45
46
47
Next