Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

2 January 2024

Papers citing "Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation"

4 / 4 papers shown

Title
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks Jiasen Lu Dhruv Batra Devi Parikh Stefan Lee SSL VLM 192 3,659 0 06 Aug 2019
Zero-Shot Learning -- The Good, the Bad and the Ugly Yongqin Xian Bernt Schiele Zeynep Akata 55 834 0 13 Mar 2017
DenseCap: Fully Convolutional Localization Networks for Dense Captioning Justin Johnson A. Karpathy Li Fei-Fei VLM 106 1,165 0 24 Nov 2015
CIDEr: Consensus-based Image Description Evaluation Ramakrishna Vedantam C. L. Zitnick Devi Parikh 217 4,451 0 20 Nov 2014