Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 833 papers shown
Title
Automatic Controllable Colorization via Imagination
Xiaoyan Cong
Yue Wu
Qifeng Chen
Chenyang Lei
DiffM
29
5
0
08 Apr 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
81
33
0
07 Apr 2024
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
38
36
0
05 Apr 2024
Physical Property Understanding from Language-Embedded Feature Fields
Albert J. Zhai
Yuan Shen
Emily Y. Chen
Gloria X. Wang
Xinlei Wang
Sheng Wang
Kaiyu Guan
Shenlong Wang
38
13
0
05 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLM
MLLM
31
3
0
02 Apr 2024
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models
Kyuyoung Kim
Jongheon Jeong
Minyong An
Mohammad Ghavamzadeh
Krishnamurthy Dvijotham
Jinwoo Shin
Kimin Lee
EGVM
42
6
0
02 Apr 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
40
7
0
01 Apr 2024
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Ruohong Zhang
Liangke Gui
Zhiqing Sun
Yihao Feng
Keyang Xu
...
Di Fu
Chunyuan Li
Alexander G. Hauptmann
Yonatan Bisk
Yiming Yang
MLLM
56
60
0
01 Apr 2024
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
Barbara Toniella Corradini
Mustafa Shukor
Paul Couairon
Guillaume Couairon
Franco Scarselli
Matthieu Cord
DiffM
VLM
45
4
0
29 Mar 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Peng Gao
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Hongsheng Li
VLM
71
33
0
29 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
104
4
0
28 Mar 2024
RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents
Zeren Chen
Zhelun Shi
Xiaoya Lu
Lehan He
Sucheng Qian
...
Zhen-fei Yin
Jing Shao
Jing Shao
Cewu Lu
Cewu Lu
38
5
0
28 Mar 2024
Garment3DGen: 3D Garment Stylization and Texture Generation
N. Sarafianos
Tuur Stuyck
Xiaoyu Xiang
Yilei Li
Jovan Popovic
Rakesh Ranjan
3DH
110
17
0
27 Mar 2024
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Zhuowan Li
Bhavan A. Jasani
Peng Tang
Shabnam Ghadar
LRM
39
8
0
25 Mar 2024
Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection
Minzhou Pan
Zhengting Wang
Xin Dong
Vikash Sehwag
Lingjuan Lyu
Xue Lin
40
3
0
23 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
46
14
0
22 Mar 2024
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
Han Zhao
Min Zhang
Wei Zhao
Pengxiang Ding
Siteng Huang
Donglin Wang
Mamba
52
66
0
21 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
56
7
0
21 Mar 2024
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Jingkun An
Yinghao Zhu
Zongjian Li
Haoran Feng
Bohua Chen
Yemin Shi
Chengwei Pan
43
2
0
20 Mar 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
57
4
0
19 Mar 2024
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Ruyi Xu
Yuan Yao
Zonghao Guo
Junbo Cui
Zanlin Ni
Chunjiang Ge
Tat-Seng Chua
Zhiyuan Liu
Maosong Sun
Gao Huang
VLM
MLLM
37
104
0
18 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
40
8
0
18 Mar 2024
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
Ruicheng Wang
Jianfeng Xiang
Jiaolong Yang
Xin Tong
DiffM
40
4
0
18 Mar 2024
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study
Chenguang Wang
Ruoxi Jia
Xin Liu
Dawn Song
VLM
29
7
0
15 Mar 2024
Autonomous Monitoring of Pharmaceutical R&D Laboratories with 6 Axis Arm Equipped Quadruped Robot and Generative AI: A Preliminary Study
Shunichi Hato
Nozomi Ogawa
31
1
0
15 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
62
3
0
15 Mar 2024
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Haiwen Huang
Songyou Peng
Dan Zhang
Andreas Geiger
VLM
37
3
0
14 Mar 2024
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
Sipeng Zheng
Bohan Zhou
Yicheng Feng
Ye Wang
Zongqing Lu
VLM
MLLM
46
7
0
14 Mar 2024
PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning
Qifeng Zhou
Wenliang Zhong
Yuzhi Guo
Michael Xiao
Hehuan Ma
Junzhou Huang
49
10
0
13 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
39
10
0
13 Mar 2024
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation
Dingbang Li
Wenzhou Chen
Xin Lin
LLMAG
LM&Ro
47
4
0
13 Mar 2024
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
Minbin Huang
Yanxin Long
Xinchi Deng
Ruihang Chu
Jiangfeng Xiong
Xiaodan Liang
Hong Cheng
Qinglin Lu
Wei Liu
MLLM
EGVM
65
8
0
13 Mar 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Lei Zhu
Fangyun Wei
Yanye Lu
MLLM
VLM
52
17
0
12 Mar 2024
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning
Bingqian Lin
Yunshuang Nie
Ziming Wei
Jiaqi Chen
Shikui Ma
Jianhua Han
Hang Xu
Xiaojun Chang
Xiaodan Liang
LM&Ro
LRM
62
20
0
12 Mar 2024
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
Tianhao Qi
Shancheng Fang
Yanze Wu
Hongtao Xie
Jiawei Liu
Lang Chen
Qian He
Yongdong Zhang
DiffM
25
32
0
11 Mar 2024
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model
Junsu Kim
Yunhoe Ku
Jihyeon Kim
Junuk Cha
Seungryul Baek
ObjD
VLM
37
12
0
08 Mar 2024
Debiasing Multimodal Large Language Models
Yi-Fan Zhang
Weichen Yu
Qingsong Wen
Xue Wang
Zhang Zhang
Liang Wang
Rong Jin
Tien-Ping Tan
53
4
0
08 Mar 2024
Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models
Qiuhui Chen
Huping Ye
Yi Hong
MedIm
46
1
0
08 Mar 2024
Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis
Mu-Hwa Chen
Yi Liu
Jian Yi
Changran Xu
Qiuxia Lai
Hongliang Wang
Tsung-Yi Ho
Qiang Xu
EGVM
40
7
0
08 Mar 2024
XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution
Yunpeng Qu
Kun Yuan
Kai Zhao
Qizhi Xie
Jinhua Hao
Ming Sun
Chao Zhou
27
17
0
08 Mar 2024
Embodied Understanding of Driving Scenarios
Yunsong Zhou
Linyan Huang
Qingwen Bu
Jia Zeng
Tianyu Li
Hang Qiu
Hongzi Zhu
Minyi Guo
Yu Qiao
Hongyang Li
LM&Ro
62
31
0
07 Mar 2024
Large Language Models are In-Context Molecule Learners
Jiatong Li
Wei Liu
Zhihao Ding
Wenqi Fan
Yuqiang Li
Qing Li
48
5
0
07 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
39
14
0
06 Mar 2024
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Maksim Kuprashevich
Grigorii Alekseenko
Irina Tolstykh
ELM
56
4
0
04 Mar 2024
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
Lukas Höllein
Aljavz Bovzivc
Norman Muller
David Novotny
Hung-Yu Tseng
Christian Richardt
Michael Zollhöfer
Matthias Nießner
DiffM
49
39
0
04 Mar 2024
Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency
Akila Wickramasekara
F. Breitinger
Mark Scanlon
52
8
0
29 Feb 2024
Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model
Hao-Ran Cheng
Erjia Xiao
Jindong Gu
Le Yang
Jinhao Duan
Jize Zhang
Jiahang Cao
Kaidi Xu
Renjing Xu
37
6
0
29 Feb 2024
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
Yulong Liu
Yunlong Yuan
Chunwei Wang
Jianhua Han
Yongqiang Ma
Li Zhang
Nanning Zheng
Hang Xu
LLMAG
45
5
0
28 Feb 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
34
24
0
28 Feb 2024
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Xiujie Song
Mengyue Wu
Ke Zhu
Chunhao Zhang
Yanyi Chen
LRM
ELM
36
3
0
28 Feb 2024
Previous
1
2
3
...
11
12
13
...
15
16
17
Next