Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
Salient Object-Aware Background Generation using Text-Guided Diffusion Models
Amir Erfan Eshratifar
JOÃO-BRUNO Soares
K. Thadani
Shaunak Mishra
Mikhail Kuznetsov
Yueh-Ning Ku
P.De Juan
DiffM
117
4
0
15 Apr 2024
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
Han Lin
Jaemin Cho
Abhaysinh Zala
Mohit Bansal
DiffM
VGen
159
28
0
15 Apr 2024
Evolving Interpretable Visual Classifiers with Large Language Models
Mia Chiquier
Utkarsh Mall
Carl Vondrick
VLM
99
11
0
15 Apr 2024
MMInA: Benchmarking Multihop Multimodal Internet Agents
Ziniu Zhang
Shulin Tian
Liangyu Chen
Ziwei Liu
LLMAG
LM&Ro
72
22
0
15 Apr 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao
Renjie Pi
Jianhua Han
Xiaodan Liang
Hang Xu
Wei Zhang
Zhenguo Li
Dan Xu
VLM
ObjD
96
26
0
14 Apr 2024
On Speculative Decoding for Multimodal Large Language Models
Mukul Gagrani
Raghavv Goel
Wonseok Jeon
Junyoung Park
Mingu Lee
Christopher Lott
LRM
68
11
0
13 Apr 2024
Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation
Brinnae Bent
67
3
0
12 Apr 2024
LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning
Junchi Wang
Lei Ke
MLLM
LRM
VLM
81
29
0
12 Apr 2024
Training a Vision Language Model as Smartphone Assistant
Nicolai Dorka
Janusz Marecki
Ammar Anwar
68
5
0
12 Apr 2024
COCONut: Modernizing COCO Segmentation
XueQing Deng
Qihang Yu
Peng Wang
Xiaohui Shen
Liang-Chieh Chen
84
17
0
12 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
104
11
0
12 Apr 2024
LaSagnA: Language-based Segmentation Assistant for Complex Queries
Cong Wei
Haoxian Tan
Yujie Zhong
Yujiu Yang
Lin Ma
113
17
0
12 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
86
5
0
12 Apr 2024
Connecting NeRFs, Images, and Text
Francesco Ballerini
Pierluigi Zama Ramirez
Roberto Mirabella
Samuele Salti
Luigi Di Stefano
112
5
0
11 Apr 2024
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Dejia Xu
Vidit Goel
Xingqian Xu
Zhangyang Wang
Humphrey Shi
N. Sebe
117
37
0
11 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjD
MLLM
157
51
0
11 Apr 2024
Taming Stable Diffusion for Text to 360° Panorama Image Generation
Cheng Zhang
Qianyi Wu
Camilo Cruz Gambardella
Xiaoshui Huang
Dinh Q. Phung
Wanli Ouyang
Jianfei Cai
MDE
81
9
0
11 Apr 2024
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
Yuichi Inoue
Kento Sasaki
Yuma Ochi
Kazuki Fujii
Kotaro Tanahashi
Yu Yamaguchi
VLM
59
5
0
11 Apr 2024
Implicit and Explicit Language Guidance for Diffusion-based Visual Perception
Hefeng Wang
Jiale Cao
Jin Xie
Aiping Yang
Yanwei Pang
VLM
DiffM
110
2
0
11 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
77
31
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
80
32
0
10 Apr 2024
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic
Sachin Goyal
Pratyush Maini
Zachary Chase Lipton
Aditi Raghunathan
J. Zico Kolter
105
46
0
10 Apr 2024
Unified Language-driven Zero-shot Domain Adaptation
Senqiao Yang
Zhuotao Tian
Li Jiang
Jiaya Jia
95
10
0
10 Apr 2024
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning
Alexandros Xenos
Niki Maria Foteinopoulou
Ioanna Ntinou
Ioannis Patras
Georgios Tzimiropoulos
88
16
0
10 Apr 2024
Identification of Fine-grained Systematic Errors via Controlled Scene Generation
Valentyn Boreiko
Matthias Hein
J. H. Metzen
83
1
0
10 Apr 2024
HRVDA: High-Resolution Visual Document Assistant
Chaohu Liu
Kun Yin
Haoyu Cao
Xinghua Jiang
Xin Li
Yinsong Liu
Deqiang Jiang
Xing Sun
Linli Xu
VLM
102
26
0
10 Apr 2024
UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion
Junsheng Zhou
Weiqi Zhang
Baorui Ma
Kanle Shi
Yu-Shen Liu
Zhizhong Han
121
19
0
10 Apr 2024
GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation
Mukul Khanna
Ram Ramrakhya
Gunjan Chhablani
Sriram Yenamandra
Théophile Gervet
Matthew Chang
Z. Kira
Devendra Singh Chaplot
Dhruv Batra
Roozbeh Mottaghi
LM&Ro
126
34
0
09 Apr 2024
Anchor-based Robust Finetuning of Vision-Language Models
Jinwei Han
Zhiwen Lin
Zhongyi Sun
Yingguo Gao
Ke Yan
Shouhong Ding
Yuan Gao
Gui-Song Xia
VLM
117
6
0
09 Apr 2024
OmniFusion Technical Report
Elizaveta Goncharova
Anton Razzhigaev
Matvey Mikhalchuk
Maxim Kurkin
Irina Abdullaeva
Matvey Skripkin
Ivan Oseledets
Denis Dimitrov
Andrey Kuznetsov
74
4
0
09 Apr 2024
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation
Junkai Yan
Yipeng Gao
Q. Yang
Xihan Wei
Xuansong Xie
Ancong Wu
Wei-Shi Zheng
78
2
0
09 Apr 2024
Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning
Yupei Zhang
Li Pan
Qiushi Yang
Tan Li
Zhen Chen
91
1
0
09 Apr 2024
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
Junpeng Liu
Yifan Song
Bill Yuchen Lin
Wai Lam
Graham Neubig
Yuanzhi Li
Xiang Yue
VLM
132
49
0
09 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
95
31
0
09 Apr 2024
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
Bo Peng
Daniel Goldstein
Quentin G. Anthony
Alon Albalak
Eric Alcaide
...
Bingchen Zhao
Qihang Zhao
Peng Zhou
Jian Zhu
Ruijie Zhu
119
82
0
08 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
135
101
0
08 Apr 2024
CoReS: Orchestrating the Dance of Reasoning and Segmentation
Xiaoyi Bao
Siyang Sun
Shuailei Ma
Kecheng Zheng
Yuxin Guo
Guosheng Zhao
Yun Zheng
Xingang Wang
LRM
120
10
0
08 Apr 2024
Automatic Controllable Colorization via Imagination
Xiaoyan Cong
Yue Wu
Qifeng Chen
Chenyang Lei
DiffM
62
5
0
08 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Massimiliano Mancini
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
79
2
0
08 Apr 2024
Test-Time Zero-Shot Temporal Action Localization
Benedetta Liberatori
Alessandro Conti
Paolo Rota
Yiming Wang
Elisa Ricci
141
5
0
08 Apr 2024
Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
39
0
0
08 Apr 2024
Facial Affective Behavior Analysis with Instruction Tuning
Yifan Li
Anh Dao
Wentao Bao
Zhen Tan
Tianlong Chen
Huan Liu
Yu Kong
CVBM
116
15
0
07 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
67
7
0
07 Apr 2024
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan
Jinfa Huang
Yujun Shi
Yongqi Xu
Ruijie Zhu
Bin Lin
Xinhua Cheng
Li-xin Yuan
Jiebo Luo
VGen
169
36
0
07 Apr 2024
Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model
Zhonghan Zhao
Ke Ma
Wenhao Chai
Xuan Wang
Kewei Chen
Dongxu Guo
Yanting Zhang
Hongwei Wang
Gaoang Wang
81
20
0
06 Apr 2024
Diffusion Time-step Curriculum for One Image to 3D Generation
Xuanyu Yi
Zike Wu
Qingshan Xu
Pan Zhou
Joo-Hwee Lim
Hanwang Zhang
126
20
0
06 Apr 2024
Mixed-Query Transformer: A Unified Image Segmentation Architecture
Pei Wang
Zhaowei Cai
Hao Yang
Ashwin Swaminathan
R. Manmatha
Stefano Soatto
121
2
0
06 Apr 2024
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
111
41
0
05 Apr 2024
Physical Property Understanding from Language-Embedded Feature Fields
Albert J. Zhai
Yuan Shen
Emily Y. Chen
Gloria X. Wang
Xinlei Wang
Sheng Wang
Kaiyu Guan
Shenlong Wang
78
14
0
05 Apr 2024
PARIS3D: Reasoning-based 3D Part Segmentation Using Large Multimodal Model
Amrin Kareem
Jean Lahoud
Hisham Cholakkal
LRM
92
4
0
04 Apr 2024
Previous
1
2
3
...
39
40
41
...
45
46
47
Next