Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 935 papers shown
Title
Large Scale Masked Autoencoding for Reducing Label Requirements on SAR Data
Matt Allen
Francisco Dorr
Joseph A. Gallego-Mejia
Laura Martínez-Ferrer
Anna Jungbluth
F. Kalaitzis
Raúl Ramos-Pollán
83
9
0
02 Oct 2023
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Tianyu Yu
Jinyi Hu
Yuan Yao
Haoye Zhang
Yue Zhao
...
Jiao Xue
Dahai Li
Zhiyuan Liu
Hai-Tao Zheng
Maosong Sun
VLM
MLLM
45
20
0
01 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRM
MLLM
113
22
0
01 Oct 2023
Knowledge Engineering using Large Language Models
Bradley Paul Allen
Lise Stork
Paul T. Groth
91
25
0
01 Oct 2023
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
Yulu Gan
Sungwoo Park
Alexander Schubert
Anthony Philippakis
Ahmed Alaa
VLM
109
25
0
30 Sep 2023
Training a Large Video Model on a Single Machine in a Day
Yue Zhao
Philipp Krahenbuhl
VLM
94
17
0
28 Sep 2023
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Sanghwan Kim
Hao Tang
Fisher Yu
VLM
CLIP
69
5
0
28 Sep 2023
Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks
Huihui Gong
Minjing Dong
Siqi Ma
S. Çamtepe
Surya Nepal
Chang Xu
AAML
OOD
53
1
0
28 Sep 2023
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu
Chen Li
Yixiao Ge
Ying Shan
Thomas H. Li
Ge Li
116
30
0
27 Sep 2023
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
Hila Levi
Guy Heller
Dan Levi
Ethan Fetaya
OCL
VLM
69
4
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
63
10
0
26 Sep 2023
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Z. Yao
Xiaoxia Wu
Conglong Li
Minjia Zhang
Heyang Qi
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
93
11
0
25 Sep 2023
Beyond Grids: Exploring Elastic Input Sampling for Vision Transformers
Adam Pardyl
Grzegorz Kurzejamski
Jan Olszewski
Tomasz Trzciñski
Bartosz Zieliñski
56
1
0
23 Sep 2023
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
Kan Wu
Houwen Peng
Zhenghong Zhou
Bin Xiao
Mengchen Liu
...
Xi
Xi Chen
Xinggang Wang
Hongyang Chao
Han Hu
VLM
OODD
86
64
0
21 Sep 2023
ContextRef: Evaluating Referenceless Metrics For Image Description Generation
Elisa Kreiss
E. Zelikman
Christopher Potts
Nick Haber
76
5
0
21 Sep 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
109
199
0
20 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
77
4
0
16 Sep 2023
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning
Zhiwu Qing
Shiwei Zhang
Ziyuan Huang
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
78
20
0
14 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
64
3
0
14 Sep 2023
PRE: Vision-Language Prompt Learning with Reparameterization Encoder
Anh Pham Thi Minh
An Duc Nguyen
Georgios Tzimiropoulos
VPVLM
VLM
85
3
0
14 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
Haoqin Tu
Bingchen Zhao
Chen Wei
Cihang Xie
MLLM
69
15
0
13 Sep 2023
Language Models as Black-Box Optimizers for Vision-Language Models
Shihong Liu
Zhiqiu Lin
Samuel Yu
Ryan Lee
Tiffany Ling
Deepak Pathak
Deva Ramanan
VLM
126
30
0
12 Sep 2023
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Yang Jin
Kun Xu
Kun Xu
Liwei Chen
Chao Liao
...
Xiaoqiang Lei
Di Zhang
Wenwu Ou
Kun Gai
Yadong Mu
MLLM
VLM
79
50
0
09 Sep 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng Zhang
Han Hu
DongDong Chen
Baining Guo
DiffM
VLM
118
107
0
07 Sep 2023
Dual Relation Alignment for Composed Image Retrieval
Xintong Jiang
Yaxiong Wang
Yujiao Wu
Ming Wang
Xueming Qian
50
6
0
05 Sep 2023
ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer
Gyeongdong Yang
Yungwook Kwon
Hyunjin Kim
ViT
32
2
0
04 Sep 2023
BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning
Yi Zhang
Ce Zhang
Zihan Liao
Yushun Tang
Zhihai He
BDL
VLM
106
10
0
03 Sep 2023
Contrastive Feature Masking Open-Vocabulary Vision Transformer
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
VLM
120
27
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Zhiyong Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
96
8
0
31 Aug 2023
Do the Frankenstein, or how to achieve better out-of-distribution performance with manifold mixing model soup
Hannes Fassold
MoMe
UQCV
43
2
0
28 Aug 2023
Fine-tuning can cripple your foundation model; preserving features may be the solution
Jishnu Mukhoti
Y. Gal
Philip Torr
P. Dokania
CLL
130
47
0
25 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
194
945
0
24 Aug 2023
DLIP: Distilling Language-Image Pre-training
Huafeng Kuang
Jie Wu
Xiawu Zheng
Ming Li
Xuefeng Xiao
Rui Wang
Min Zheng
Rongrong Ji
VLM
70
4
0
24 Aug 2023
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu
Yuan Yao
Chong Wang
Shanonan Wang
Yinxu Pan
...
Yankai Lin
Jiao Xue
Dahai Li
Zhiyuan Liu
Maosong Sun
MLLM
VLM
108
55
0
23 Aug 2023
Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
Kangmin Xu
Liang Liao
Jing Xiao
Chaofeng Chen
Haoning Wu
Qiong Yan
Weisi Lin
ViT
55
5
0
23 Aug 2023
Unsupervised Prototype Adapter for Vision-Language Models
Yi Zhang
Ce Zhang
Xue-mei Hu
Z. He
VLM
74
4
0
22 Aug 2023
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Xi Deng
Han Shi
Runhu Huang
Changlin Li
Hang Xu
Jianhua Han
James T. Kwok
Shen Zhao
Wei Zhang
Xiaodan Liang
CLIP
VLM
91
3
0
22 Aug 2023
Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models
Peiyan Zhang
Hao Liu
Chaozhuo Li
Xing Xie
Sunghun Kim
Haohan Wang
VLM
OOD
126
8
0
21 Aug 2023
Dataset Quantization
Daquan Zhou
Kaixin Wang
Jianyang Gu
Xiang Peng
Dongze Lian
Yifan Zhang
Yang You
Jiashi Feng
DD
89
41
0
21 Aug 2023
Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
46
1
0
20 Aug 2023
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Weixian Lei
Yixiao Ge
Jianfeng Zhang
Dylan Sun
Kun Yi
Ying Shan
Mike Zheng Shou
61
1
0
20 Aug 2023
TDG: Text-guided Domain Generalization
Geng-Ming Liu
Yuxi Wang
OOD
76
2
0
19 Aug 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
74
1
0
18 Aug 2023
Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers
Tobias Christian Nauen
Sebastián M. Palacio
Federico Raue
Andreas Dengel
145
4
0
18 Aug 2023
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Chaorui Deng
Qi Chen
Pengda Qin
Dave Zhenyu Chen
Qi Wu
VLM
CLIP
83
34
0
15 Aug 2023
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation
Hongguang Zhu
Yunchao Wei
Xiaodan Liang
Chunjie Zhang
Yao-Min Zhao
VLM
72
30
0
14 Aug 2023
Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
Yan Zhu
Junbao Zhuo
B. Ma
Jiajia Geng
Xiaoming Wei
Xiaolin K. Wei
Shuhui Wang
VLM
61
6
0
14 Aug 2023
Temporally-Adaptive Models for Efficient Video Understanding
Ziyuan Huang
Shiwei Zhang
Liang Pan
Zhiwu Qing
Yingya Zhang
Ziwei Liu
Marcelo H. Ang
78
10
0
10 Aug 2023
ViLP: Knowledge Exploration using Vision, Language, and Pose Embeddings for Video Action Recognition
S. Chaudhuri
Saumik Bhattacharya
80
3
0
07 Aug 2023
Previous
1
2
3
...
10
11
12
...
17
18
19
Next