Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.06167
Cited By
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
2 January 2024
Change Che
Qunwei Lin
Xinyu Zhao
Jiaxin Huang
Liqiang Yu
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation"
4 / 4 papers shown
Title
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
192
3,659
0
06 Aug 2019
Zero-Shot Learning -- The Good, the Bad and the Ugly
Yongqin Xian
Bernt Schiele
Zeynep Akata
55
834
0
13 Mar 2017
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson
A. Karpathy
Li Fei-Fei
VLM
106
1,165
0
24 Nov 2015
CIDEr: Consensus-based Image Description Evaluation
Ramakrishna Vedantam
C. L. Zitnick
Devi Parikh
217
4,451
0
20 Nov 2014
1