Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.12597
Cited By
v1
v2
v3 (latest)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
30 January 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
50 / 2,345 papers shown
Title
Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study
Atsushi Teramoto
Ayano Michiba
Yuka Kiriyama
Tetsuya Tsukamoto
K. Imaizumi
H. Fujita
MedIm
52
1
0
26 Mar 2024
OmniVid: A Generative Framework for Universal Video Understanding
Junke Wang
Dongdong Chen
Chong Luo
Bo He
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
VLM
VGen
119
16
0
26 Mar 2024
Assessment of Multimodal Large Language Models in Alignment with Human Values
Zhelun Shi
Zhipin Wang
Hongxing Fan
Zaibin Zhang
Lijun Li
Yongting Zhang
Zhen-fei Yin
Lu Sheng
Yu Qiao
Jing Shao
77
22
0
26 Mar 2024
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving
Mingfu Liang
Jong-Chyi Su
S. Schulter
Sparsh Garg
Shiyu Zhao
Ying Nian Wu
Manmohan Chandraker
VLM
95
15
0
26 Mar 2024
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao
Shengju Qian
Han Xiao
Guanglu Song
Zhuofan Zong
Letian Wang
Yu Liu
Hongsheng Li
VGen
LRM
MLLM
112
77
0
25 Mar 2024
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
Omkar Thawakar
Muzammal Naseer
Rao Muhammad Anwer
Salman Khan
Michael Felsberg
Mubarak Shah
Fahad Shahbaz Khan
56
11
0
25 Mar 2024
Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text
Junshu Tang
Yanhong Zeng
Ke Fan
Xuheng Wang
Bo Dai
Kai Chen
Lizhuang Ma
74
7
0
25 Mar 2024
DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization
Yunlong Tang
Yuxuan Wan
Lei Qi
Xin Geng
VLM
76
5
0
25 Mar 2024
Elysium: Exploring Object-level Perception in Videos via MLLM
Hang Wang
Yanjie Wang
Yongjie Ye
Yuxiang Nie
Can Huang
MLLM
88
23
0
25 Mar 2024
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
Yingshan Chang
Yasi Zhang
Zhiyuan Fang
Yingnian Wu
Yonatan Bisk
Feng Gao
EGVM
116
7
0
25 Mar 2024
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Zhuowan Li
Bhavan A. Jasani
Peng Tang
Shabnam Ghadar
LRM
85
10
0
25 Mar 2024
Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
Minchan Kim
Minyeong Kim
Junik Bae
Suhwan Choi
Sungkyung Kim
Buru Chang
VLM
45
4
0
24 Mar 2024
Enhancing Video Transformers for Action Understanding with VLM-aided Training
Hui Lu
Hu Jian
Ronald Poppe
A. A. Salah
76
2
0
24 Mar 2024
Enhancing Visual Continual Learning with Language-Guided Supervision
Bolin Ni
Hongbo Zhao
Chenghao Zhang
Ke Hu
Gaofeng Meng
Zhaoxiang Zhang
Shiming Xiang
CLL
VLM
135
4
0
24 Mar 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yuchen Suo
Fan Ma
Linchao Zhu
Yi Yang
82
24
0
24 Mar 2024
Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection
Minzhou Pan
Zhengting Wang
Xin Dong
Vikash Sehwag
Lingjuan Lyu
Xue Lin
82
3
0
23 Mar 2024
Explore until Confident: Efficient Exploration for Embodied Question Answering
Allen Z. Ren
Jaden Clark
Anushri Dixit
Masha Itkina
Anirudha Majumdar
Dorsa Sadigh
214
34
0
23 Mar 2024
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang
Mu Cai
Bingxin Xu
Yong Jae Lee
Yan Yan
VLM
134
127
0
22 Mar 2024
Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
Jimyeong Kim
Jungwon Park
Wonjong Rhee
DiffM
91
5
0
22 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Weihao Ye
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
MoE
84
1
0
22 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
66
1
0
22 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
129
17
0
22 Mar 2024
WeatherProof: Leveraging Language Guidance for Semantic Segmentation in Adverse Weather
Blake Gella
Howard Zhang
Rishi Upadhyay
Tiffany Chang
Nathan Wei
Matthew Waliman
Yunhao Bao
C. Melo
Alex Wong
A. Kadambi
67
0
0
21 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLM
AI4TS
92
4
0
21 Mar 2024
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
Bowen Jiang
Zhijun Zhuang
Shreyas S. Shivakumar
Dan Roth
Camillo J Taylor
LLMAG
55
3
0
21 Mar 2024
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Yiwei Zhou
Xiaobo Xia
Zhiwei Lin
Bo Han
Tongliang Liu
VLM
106
16
0
21 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
116
6
0
21 Mar 2024
DreamReward: Text-to-3D Generation with Human Preference
Junliang Ye
Fangfu Liu
Qixiu Li
Zhengyi Wang
Yikai Wang
Xinzhou Wang
Yueqi Duan
Jun Zhu
107
29
0
21 Mar 2024
ReNoise: Real Image Inversion Through Iterative Noising
Daniel Garibi
Or Patashnik
Andrey Voynov
Hadar Averbuch-Elor
Daniel Cohen-Or
DiffM
111
57
0
21 Mar 2024
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf
Elad Richardson
Sergey Tulyakov
Kfir Aberman
Daniel Cohen-Or
MLLM
VLM
107
23
0
21 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLM
MLLM
127
47
0
21 Mar 2024
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
Han Zhao
Min Zhang
Wei Zhao
Pengxiang Ding
Siteng Huang
Donglin Wang
Mamba
119
74
0
21 Mar 2024
Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination
Dingchen Yang
Bowen Cao
Guang Chen
Changjun Jiang
87
11
0
21 Mar 2024
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing
Ajian Liu
Shuai Xue
Jianwen Gan
Jun Wan
Yanyan Liang
Jiankang Deng
Sergio Escalera
Zhen Lei
VLM
73
27
0
21 Mar 2024
Empowering Segmentation Ability to Multi-modal Large Language Models
Yuqi Yang
Peng-Tao Jiang
Jing Wang
Hao Zhang
Kai Zhao
Jinwei Chen
Yue Liu
LRM
VLM
86
4
0
21 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
171
9
0
21 Mar 2024
VL-Mamba: Exploring State Space Models for Multimodal Learning
Yanyuan Qiao
Zheng Yu
Longteng Guo
Sihan Chen
Zijia Zhao
Mingzhen Sun
Qi Wu
Jing Liu
Mamba
114
72
0
20 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
103
13
0
20 Mar 2024
Inserting Faces inside Captions: Image Captioning with Attention Guided Merging
Yannis Tevissen
Khalil Guetari
Marine Tassel
Erwan Kerleroux
Frédéric Petitpont
65
0
0
20 Mar 2024
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Jingkun An
Yinghao Zhu
Zongjian Li
Haoran Feng
Bohua Chen
Yemin Shi
Chengwei Pan
67
2
0
20 Mar 2024
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models
Zuyan Liu
Yuhao Dong
Yongming Rao
Jie Zhou
Jiwen Lu
LRM
79
21
0
19 Mar 2024
GVGEN: Text-to-3D Generation with Volumetric Representation
Xianglong He
Junyi Chen
Sida Peng
Di Huang
Yangguang Li
Xiaoshui Huang
Chun Yuan
Wanli Ouyang
Tong He
3DGS
DiffM
138
30
0
19 Mar 2024
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu
Haiyang Xu
Jiabo Ye
Mingshi Yan
Liang Zhang
...
Chen Li
Ji Zhang
Qin Jin
Fei Huang
Jingren Zhou
VLM
117
125
0
19 Mar 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
P. D. Haghighi
Hamid Rezatofighi
LRM
99
12
0
19 Mar 2024
VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Zhipeng Huang
Zhizheng Zhang
Yiting Lu
Zheng-Jun Zha
Zhibo Chen
Baining Guo
MLLM
95
12
0
19 Mar 2024
RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Zhipeng Huang
Zhizheng Zhang
Zheng-Jun Zha
Yan Lu
Baining Guo
VLM
56
3
0
19 Mar 2024
Towards Multimodal In-Context Learning for Vision & Language Models
Sivan Doveh
Shaked Perek
M. Jehanzeb Mirza
Wei Lin
Amit Alfassy
Assaf Arbelle
S. Ullman
Leonid Karlinsky
VLM
184
18
0
19 Mar 2024
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
Yuanhuiyi Lyu
Xueye Zheng
Jiazhou Zhou
Lin Wang
94
25
0
19 Mar 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
132
4
0
19 Mar 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
Enshen Zhou
Yiran Qin
Zhen-fei Yin
Yuzhou Huang
Ruimao Zhang
Lu Sheng
Yu Qiao
Jing Shao
LM&Ro
AI4CE
113
36
0
18 Mar 2024
Previous
1
2
3
...
41
42
43
...
45
46
47
Next