Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,352 papers shown
Title
TrojVLM: Backdoor Attack Against Vision Language Models
Weimin Lyu
Lu Pang
Tengfei Ma
Haibin Ling
Chao Chen
MLLM
97
11
0
28 Sep 2024
Conditional Image Synthesis with Diffusion Models: A Survey
Zheyuan Zhan
Defang Chen
Jian-Ping Mei
Zhenghe Zhao
Jiawei Chen
Chun-Yen Chen
Siwei Lyu
Can Wang
VLM
109
10
0
28 Sep 2024
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
116
233
0
27 Sep 2024
Trustworthy Text-to-Image Diffusion Models: A Timely and Focused Survey
Yi Zhang
Zhen Chen
Chih-Hong Cheng
Wenjie Ruan
Xiaowei Huang
Dezong Zhao
David Flynn
Siddartha Khastgir
Xingyu Zhao
MedIm
97
4
0
26 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Ye Liu
Zongyang Ma
Zhongang Qi
Yang Wu
Ying Shan
Chang Wen Chen
112
23
0
26 Sep 2024
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation
Xin Li
Siyuan Huang
Qiaojun Yu
Zhengkai Jiang
Ce Hao
Yimeng Zhu
Hongsheng Li
Peng Gao
Cewu Lu
77
0
0
26 Sep 2024
Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation
Qihan Huang
Siming Fu
Jinlong Liu
Hao Jiang
Yipeng Yu
Jie Song
74
9
0
26 Sep 2024
Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications
Nghia Nguyen
Minh Nhat Vu
Tung D. Ta
Baoru Huang
T. Vo
Ngan Le
Anh Nguyen
VLM
CLIP
79
6
0
26 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
175
12
0
26 Sep 2024
Neural Contrast: Leveraging Generative Editing for Graphic Design Recommendations
Marian Lupascu
Ionut Mironica
Mihai-Sorin Stupariu
DiffM
66
0
0
26 Sep 2024
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
Zhangpu Li
Changhong Zou
Suxue Ma
Zhicheng Yang
Chen Du
...
Xingzhi Sun
Jing Xiao
Kai Zhang
Mei Han
Mei Han
LM&MA
98
1
0
26 Sep 2024
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization
Kento Kawaharazuka
Yoshiki Obinata
Naoaki Kanazawa
Kei Okada
Masayuki Inaba
LM&Ro
61
0
0
26 Sep 2024
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
Xun Zhu
Ying Hu
Fanbin Mo
Miao Li
Ji Wu
125
9
0
26 Sep 2024
SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning
Rimvydas Rubavicius
Peter David Fagan
A. Lascarides
Subramanian Ramamoorthy
LM&Ro
456
0
0
26 Sep 2024
Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography
Yuexi Du
John Onofrey
Nicha Dvornek
VLM
110
2
0
26 Sep 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
248
52
0
26 Sep 2024
ChatCam: Empowering Camera Control through Conversational AI
Xinhang Liu
Yu-Wing Tai
Chi-Keung Tang
VGen
81
3
0
25 Sep 2024
Blox-Net: Generative Design-for-Robot-Assembly Using VLM Supervision, Physics Simulation, and a Robot with Reset
Andrew Goldberg
Kavish Kondap
Tianshuang Qiu
Zehan Ma
Letian Fu
Justin Kerr
Huang Huang
Kaiyuan Chen
Kuan Fang
Ken Goldberg
79
4
0
25 Sep 2024
DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling
Kyuheon Jung
Yongdeuk Seo
Seongwoo Cho
Jaeyoung Kim
Hyun-seok Min
Sungchul Choi
33
1
0
25 Sep 2024
The Role of Language Models in Modern Healthcare: A Comprehensive Review
Amna Khalid
Ayma Khalid
Umar Khalid
LM&MA
68
0
0
25 Sep 2024
How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not
Francesco Verdini
Pierfrancesco Melucci
Stefano Perna
Francesco Cariaggi
Marco Gaido
...
Marek Kasztelnik
L. Bentivogli
Sébastien Bratières
P. Merialdo
Simone Scardapane
AuLLM
69
1
0
25 Sep 2024
Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation
Siyin Wang
Wenyi Yu
Yudong Yang
Changli Tang
Yixuan Li
...
Jun Zhang
Guangzhi Sun
Lu Lu
Yuxuan Wang
Chao Zhang
AuLLM
LM&MA
136
8
0
25 Sep 2024
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Jingjing Chen
Zhiyu Tan
Hao Li
Jingjing Chen
MLLM
151
23
0
25 Sep 2024
GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design
Phillip Mueller
Sebastian Mueller
Lars Mikelsons
125
2
0
25 Sep 2024
A Unified Hallucination Mitigation Framework for Large Vision-Language Models
Yue Chang
Liqiang Jing
Xiaopeng Zhang
Yue Zhang
VLM
MLLM
112
4
0
24 Sep 2024
Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation
Xiaohong Liu
Guoxing Yang
Yulin Luo
Jiaji Mao
Xiang Zhang
Ming Gao
Shanghang Zhang
Jun Shen
Guangyu Wang
VLM
LM&MA
MedIm
68
2
0
24 Sep 2024
Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs
Yang Yuhang
Peng Yizhou
Eng Siong Chng
Xionghu Zhong
AuLLM
AI4CE
53
0
0
24 Sep 2024
Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving
Lingyu Xiao
Jiang-Jiang Liu
Sen Yang
Xiaofan Li
Xiaoqing Ye
Wankou Yang
Jingdong Wang
132
0
0
24 Sep 2024
SYNERGAI: Perception Alignment for Human-Robot Collaboration
Yixin Chen
Guoxi Zhang
Yaowei Zhang
Hongming Xu
Peiyuan Zhi
Qing Li
Siyuan Huang
75
0
0
24 Sep 2024
Critic Loss for Image Classification
B. Rappazzo
Aaron Ferber
Carla P. Gomes
VLM
63
0
0
23 Sep 2024
Exploring Fine-grained Retail Product Discrimination with Zero-shot Object Classification Using Vision-Language Models
Anil Osman Tur
Alessandro Conti
Cigdem Beyan
Davide Boscaini
Roberto Larcher
S. Messelodi
Fabio Poiesi
Elisa Ricci
VLM
108
0
0
23 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
154
9
0
23 Sep 2024
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models
Sombit Dey
Jan-Nico Zaech
Nikolay Nikolov
Luc Van Gool
Danda Pani Paudel
MoMe
VLM
151
5
0
23 Sep 2024
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models
Mohammad Shahab Sepehri
Zalan Fabian
Maryam Soltanolkotabi
Mahdi Soltanolkotabi
MedIm
142
6
0
23 Sep 2024
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li
Ge Zhang
Yinghao Ma
Ruibin Yuan
Kang Zhu
...
Zhaoxiang Zhang
Zachary Liu
Emmanouil Benetos
Wenhao Huang
Chenghua Lin
LRM
184
19
0
23 Sep 2024
SOS: Segment Object System for Open-World Instance Segmentation With Object Priors
Christian Wilms
Tim Rolff
Maris Hillemann
Robert Johanson
Simone Frintrop
VLM
85
1
0
22 Sep 2024
What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang
Pooneh Mousavi
Artem Ploujnikov
Mirco Ravanelli
AuLLM
99
2
0
22 Sep 2024
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu
Peitian Zhang
Zheng Liu
Minghao Qin
Yueze Wang
Tiejun Huang
Bo Zhao
VLM
141
59
0
22 Sep 2024
LLMs are One-Shot URL Classifiers and Explainers
Fariza Rashid
Nishavi Ranaweera
Ben Doyle
Suranga Seneviratne
LRM
87
3
0
22 Sep 2024
Dormant: Defending against Pose-driven Human Image Animation
Jiachen Zhou
Mingsi Wang
Tianlin Li
Guozhu Meng
Kai Chen
160
5
0
22 Sep 2024
BrainDreamer: Reasoning-Coherent and Controllable Image Generation from EEG Brain Signals via Language Guidance
Ling Wang
Chen Wu
Lin Wang
DiffM
66
0
0
21 Sep 2024
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan
Liqiang Niu
Fandong Meng
Wenbo Li
Jie Zhou
Jinsong Su
VLM
60
3
0
20 Sep 2024
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs
Bowen Yan
Zhengsong Zhang
Liqiang Jing
Eftekhar Hossain
Xinya Du
118
3
0
20 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGe
VLM
175
2
0
19 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
76
0
0
19 Sep 2024
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation
Zhengguang Zhou
Jing Li
Huaxia Li
Nemo Chen
Xu Tang
DiffM
VGen
82
11
0
19 Sep 2024
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning
Xiaotian Han
Yiren Jian
Xuefeng Hu
Haogeng Liu
Yiqi Wang
...
Yuang Ai
Huaibo Huang
Ran He
Zhenheng Yang
Quanzeng You
LRM
AI4CE
59
22
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
152
2
0
19 Sep 2024
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
Junjie Wen
Yinlin Zhu
Jinming Li
Minjie Zhu
Kun Wu
...
Ran Cheng
Yaxin Peng
Chaomin Shen
Feifei Feng
Jian Tang
LM&Ro
182
70
0
19 Sep 2024
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
Yongqi Wang
Xinxiao Wu
Shuo Yang
Jiebo Luo
458
1
0
19 Sep 2024
Previous
1
2
3
...
24
25
26
...
46
47
48
Next