Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.09734
Cited By
ClipCap: CLIP Prefix for Image Captioning
18 November 2021
Ron Mokady
Amir Hertz
Amit H. Bermano
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ClipCap: CLIP Prefix for Image Captioning"
50 / 144 papers shown
Title
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang
K. Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
29
28
0
29 Nov 2023
Image Clustering Conditioned on Text Criteria
Sehyun Kwon
Jaeseung Park
Minkyu Kim
Jaewoong Cho
Ernest K. Ryu
Kangwook Lee
VLM
42
11
0
27 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
23
7
0
26 Oct 2023
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
K. A. Noriy
Xiaosong Yang
Marcin Budka
Jian Jun Zhang
VLM
26
3
0
18 Oct 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
VLM
CoGe
43
23
0
12 Oct 2023
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
32
36
0
10 Oct 2023
Text Embeddings Reveal (Almost) As Much As Text
John X. Morris
Volodymyr Kuleshov
Vitaly Shmatikov
Alexander M. Rush
RALM
28
96
0
10 Oct 2023
Sentence-level Prompts Benefit Composed Image Retrieval
Yang Bai
Xinxing Xu
Yong-Jin Liu
Salman Khan
Fahad Khan
Wangmeng Zuo
Rick Siow Mong Goh
Chun-Mei Feng
41
26
0
09 Oct 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
32
2
0
27 Sep 2023
Jointly Training Large Autoregressive Multimodal Models
Emanuele Aiello
L. Yu
Yixin Nie
Armen Aghajanyan
Barlas Oğuz
24
29
0
27 Sep 2023
Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan
Bingquan Shen
MLLM
37
1
0
27 Sep 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming Yang
Yuan Qi
29
10
0
20 Sep 2023
DeViL: Decoding Vision features into Language
Meghal Dani
Isabel Rio-Torto
Stephan Alaniz
Zeynep Akata
VLM
42
7
0
04 Sep 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLM
CLIP
44
13
0
25 Aug 2023
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Manuele Barraco
Sara Sarto
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLM
55
19
0
23 Aug 2023
ViCo: Engaging Video Comment Generation with Human Preference Rewards
Yuchong Sun
Bei Liu
Xu Chen
Ruihua Song
Jianlong Fu
VGen
22
2
0
22 Aug 2023
Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection
Rui Cao
Ming Shan Hee
Adriel Kuek
Wen-Haw Chong
Roy Ka-Wei Lee
Jing Jiang
VLM
MLLM
27
36
0
16 Aug 2023
Reverse Stable Diffusion: What prompt was used to generate this image?
Florinel-Alin Croitoru
Vlad Hondru
Radu Tudor Ionescu
M. Shah
VLM
DiffM
42
6
0
02 Aug 2023
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei
Teng Wang
Jinrui Zhang
Zhenyu He
Chengjie Wang
Feng Zheng
VLM
31
34
0
31 Jul 2023
Divert More Attention to Vision-Language Object Tracking
Mingzhe Guo
Zhipeng Zhang
Li Jing
Haibin Ling
Heng Fan
VLM
42
3
0
19 Jul 2023
Linear Alignment of Vision-language Models for Image Captioning
Fabian Paischer
M. Hofmarcher
Sepp Hochreiter
Thomas Adler
CLIP
VLM
50
0
0
10 Jul 2023
A Multimodal Prototypical Approach for Unsupervised Sound Classification
Saksham Singh Kushwaha
Magdalena Fuentes
39
8
0
21 Jun 2023
Extending CLIP's Image-Text Alignment to Referring Image Segmentation
Seoyeon Kim
Minguk Kang
Dongwon Kim
Jaesik Park
Suha Kwak
VLM
30
10
0
14 Jun 2023
Scalable Performance Analysis for Vision-Language Models
Santiago Castro
Oana Ignat
Rada Mihalcea
VLM
35
1
0
30 May 2023
Contextual Object Detection with Multimodal Large Language Models
Yuhang Zang
Wei Li
Jun Han
Kaiyang Zhou
Chen Change Loy
ObjD
VLM
MLLM
41
78
0
29 May 2023
Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Haiwei Wu
Jiantao Zhou
Shile Zhang
118
28
0
23 May 2023
DiffCap: Exploring Continuous Diffusion on Image Captioning
Yufeng He
Zefan Cai
Xu Gan
Baobao Chang
DiffM
34
5
0
20 May 2023
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
39
159
0
19 May 2023
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Minghua Liu
Ruoxi Shi
Kaiming Kuang
Yinhao Zhu
Xuanlin Li
Shizhong Han
H. Cai
Fatih Porikli
Hao Su
3DPC
39
116
0
18 May 2023
VPGTrans: Transfer Visual Prompt Generator across LLMs
Ao Zhang
Hao Fei
Yuan Yao
Wei Ji
Li Li
Zhiyuan Liu
Tat-Seng Chua
MLLM
VLM
38
85
0
02 May 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
43
51
0
06 Apr 2023
Prefix tuning for automated audio captioning
Minkyu Kim
Kim Sung-Bin
Tae-Hyun Oh
21
42
0
30 Mar 2023
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
29
15
0
29 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
74
745
0
28 Mar 2023
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing
Zequn Zeng
Hao Zhang
Zhengjue Wang
Ruiying Lu
Dongsheng Wang
Bo Chen
BDL
DiffM
24
33
0
04 Mar 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
32
4
0
04 Mar 2023
Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning
Ivona Najdenkoska
Xiantong Zhen
M. Worring
VLM
26
18
0
28 Feb 2023
Teaching CLIP to Count to Ten
Roni Paiss
Ariel Ephrat
Omer Tov
Shiran Zada
Inbar Mosseri
Michal Irani
Tali Dekel
VLM
CLIP
39
92
0
23 Feb 2023
Prompt Stealing Attacks Against Text-to-Image Generation Models
Xinyue Shen
Y. Qu
Michael Backes
Yang Zhang
30
32
0
20 Feb 2023
Guiding Pretraining in Reinforcement Learning with Large Language Models
Yuqing Du
Olivia Watkins
Zihan Wang
Cédric Colas
Trevor Darrell
Pieter Abbeel
Abhishek Gupta
Jacob Andreas
LM&Ro
25
174
0
13 Feb 2023
Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions
Henrik Voigt
J. Hombeck
M. Meuschke
K. Lawonn
Sina Zarrieß
VLM
33
1
0
13 Feb 2023
Knowledge from Large-Scale Protein Contact Prediction Models Can Be Transferred to the Data-Scarce RNA Contact Prediction Task
Yiren Jian
Chongyang Gao
Chen Zeng
Yunjie Zhao
Soroush Vosoughi
27
0
0
13 Feb 2023
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning
Zhuolin Yang
Ming-Yu Liu
Zihan Liu
V. Korthikanti
Weili Nie
...
Yuke Zhu
M. Shoeybi
Bryan Catanzaro
Chaowei Xiao
Anima Anandkumar
VLM
RALM
34
39
0
09 Feb 2023
Prompting for Multimodal Hateful Meme Classification
Rui Cao
Roy Ka-Wei Lee
Wen-Haw Chong
Jing Jiang
VLM
25
75
0
08 Feb 2023
PAMI: partition input and aggregate outputs for model interpretation
Wei Shi
Wentao Zhang
Weishi Zheng
Ruixuan Wang
FAtt
26
3
0
07 Feb 2023
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Kuniaki Saito
Kihyuk Sohn
Xiang Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
30
107
0
06 Feb 2023
Eliminating Contextual Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion
Zuopeng Yang
Tianshu Chu
Xin Lin
Erdun Gao
Daqing Liu
J. Yang
Chaoyue Wang
DiffM
37
16
0
05 Feb 2023
IC3: Image Captioning by Committee Consensus
David M. Chan
Austin Myers
Sudheendra Vijayanarasimhan
David A. Ross
John F. Canny
32
17
0
02 Feb 2023
Discovering and Mitigating Visual Biases through Keyword Explanation
Younghyun Kim
Sangwoo Mo
Minkyu Kim
Kyungmin Lee
Jaeho Lee
Jinwoo Shin
40
33
0
26 Jan 2023
Joint Representation Learning for Text and 3D Point Cloud
Rui Huang
Xuran Pan
Henry Zheng
Haojun Jiang
Zhifeng Xie
S. Song
Gao Huang
36
19
0
18 Jan 2023
Previous
1
2
3
Next