ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 935 papers shown
Title
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
  Vision-Language Model for Remote Sensing
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Yuxiang Cai
DiffMVLM
146
67
0
20 Jun 2023
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
Fan Liu
Delong Chen
Zhan-Rong Guan
Xiaocong Zhou
Jiale Zhu
Qiaolin Ye
Liyong Fu
Jun Zhou
VLM
170
224
0
19 Jun 2023
LabelBench: A Comprehensive Framework for Benchmarking Adaptive
  Label-Efficient Learning
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
Jifan Zhang
Yifang Chen
Gregory H. Canal
Stephen Mussmann
Arnav M. Das
...
Yinglun Zhu
Jeffrey Bilmes
S. Du
Kevin Jamieson
Robert D. Nowak
VLM
100
12
0
16 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
Qingbin Liu
VLMCLIP
83
9
0
15 Jun 2023
Active Representation Learning for General Task Space with Applications
  in Robotics
Active Representation Learning for General Task Space with Applications in Robotics
Yifang Chen
Ying Huang
S. Du
Kevin Jamieson
Guanya Shi
SSL
73
3
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to
  Enhance Visio-Linguistic Compositional Understanding
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGeVLM
61
13
0
15 Jun 2023
Towards AGI in Computer Vision: Lessons Learned from GPT and Large
  Language Models
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Lingxi Xie
Longhui Wei
Xiaopeng Zhang
Kaifeng Bi
Xiaotao Gu
Jianlong Chang
Qi Tian
86
7
0
14 Jun 2023
MOFI: Learning Image Representations from Noisy Entity Annotated Images
MOFI: Learning Image Representations from Noisy Entity Annotated Images
Wentao Wu
Aleksei Timofeev
Chen Chen
Bowen Zhang
Kun Duan
...
Yantao Zheng
Jonathon Shlens
Xianzhi Du
Zhe Gan
Yinfei Yang
VLM
87
8
0
13 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLMCLIP
112
60
0
13 Jun 2023
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
  Histopathology Images
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Ming Y. Lu
Bowen Chen
Andrew Zhang
Drew F. K. Williamson
Richard J. Chen
Tong Ding
L. Le
Yung-Sung Chuang
Faisal Mahmood
VLMMedIm
208
102
0
13 Jun 2023
Towards a Machine-Learned Poisson Solver for Low-Temperature Plasma
  Simulations in Complex Geometries
Towards a Machine-Learned Poisson Solver for Low-Temperature Plasma Simulations in Complex Geometries
Ihda Chaerony Siffa
M. Becker
K. Weltmann
J. Trieschmann
62
2
0
13 Jun 2023
A Survey of Vision-Language Pre-training from the Lens of Multimodal
  Machine Translation
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation
Jeremy Gwinnup
Kevin Duh
VLM
55
3
0
12 Jun 2023
Retrieval-Enhanced Contrastive Vision-Text Models
Retrieval-Enhanced Contrastive Vision-Text Models
Ahmet Iscen
Mathilde Caron
Alireza Fathi
Cordelia Schmid
CLIPVLM
111
28
0
12 Jun 2023
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents
Fuxiao Liu
Hao Tan
Chris Tensmeyer
CLIPVLM
99
18
0
09 Jun 2023
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
  Vision-Language Tasks
UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks
Yanan Sun
Zi-Qi Zhong
Qi Fan
Chi-Keung Tang
Yu-Wing Tai
VLM
68
4
0
07 Jun 2023
Security Knowledge-Guided Fuzzing of Deep Learning Libraries
Security Knowledge-Guided Fuzzing of Deep Learning Libraries
Nima Shiri Harzevili
Mohammad Mahdi Mohajer
Moshi Wei
H. Pham
Song Wang
AAMLAI4CE
49
1
0
05 Jun 2023
Brain Diffusion for Visual Exploration: Cortical Discovery using Large
  Scale Generative Models
Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models
Andrew F. Luo
Margaret M. Henderson
Leila Wehbe
Michael J. Tarr
DiffM
94
22
0
05 Jun 2023
Revisiting Data-Free Knowledge Distillation with Poisoned Teachers
Revisiting Data-Free Knowledge Distillation with Poisoned Teachers
Junyuan Hong
Yi Zeng
Shuyang Yu
Lingjuan Lyu
R. Jia
Jiayu Zhou
AAML
55
9
0
04 Jun 2023
Revisiting the Role of Language Priors in Vision-Language Models
Revisiting the Role of Language Priors in Vision-Language Models
Zhiqiu Lin
Xinyue Chen
Deepak Pathak
Pengchuan Zhang
Deva Ramanan
VLM
159
27
0
02 Jun 2023
Towards In-context Scene Understanding
Towards In-context Scene Understanding
Ivana Balazevic
David Steiner
Nikhil Parthasarathy
Relja Arandjelović
Olivier J. Hénaff
76
31
0
02 Jun 2023
Vocabulary-free Image Classification
Vocabulary-free Image Classification
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
129
27
0
01 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for
  Vision-Language Representation Learning
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFinVLM
70
4
0
31 May 2023
Improving CLIP Training with Language Rewrites
Improving CLIP Training with Language Rewrites
Lijie Fan
Dilip Krishnan
Phillip Isola
Dina Katabi
Yonglong Tian
BDLVLMCLIP
109
177
0
31 May 2023
Too Large; Data Reduction for Vision-Language Pre-Training
Too Large; Data Reduction for Vision-Language Pre-Training
Alex Jinpeng Wang
Kevin Qinghong Lin
David Junhao Zhang
Stan Weixian Lei
Mike Zheng Shou
VLM
76
24
0
31 May 2023
Learning without Forgetting for Vision-Language Models
Learning without Forgetting for Vision-Language Models
Da-Wei Zhou
Yuanhan Zhang
Jingyi Ning
Jingyi Ning
De-Chuan Zhan
De-Chuan Zhan
Ziwei Liu
VLMCLL
140
44
0
30 May 2023
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and
  Unlabeled Image Collections
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
M. Jehanzeb Mirza
Leonid Karlinsky
Wei Lin
Mateusz Koziñski
Horst Possegger
Rogerio Feris
Horst Bischof
VLM
107
33
0
29 May 2023
Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Zehao Wang
Alexander Ku
Jason Baldridge
Thomas Griffiths
Been Kim
UQCV
89
13
0
29 May 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
Qingbin Liu
195
112
0
29 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Deeply Coupled Cross-Modal Prompt Learning
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
61
17
0
29 May 2023
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature
  Adaptation of Vision-Language Models
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models
Zhiwei Jia
P. Narayana
Arjun Reddy Akula
G. Pruthi
Haoran Su
Sugato Basu
Varun Jampani
VLMOffRL
81
4
0
28 May 2023
Learning from Children: Improving Image-Caption Pretraining via
  Curriculum
Learning from Children: Improving Image-Caption Pretraining via Curriculum
Hammad A. Ayyubi
R. Lokesh
Alireza Zareian
Bohong Wu
Shih-Fu Chang
VLMCLIP
62
2
0
27 May 2023
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating
  Vision-Language Transformers
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Anyi Rao
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
133
23
0
27 May 2023
Matrix Information Theory for Self-Supervised Learning
Matrix Information Theory for Self-Supervised Learning
Yifan Zhang
Zhi-Hao Tan
Jingqin Yang
Weiran Huang
Yang Yuan
SSL
119
19
0
27 May 2023
Generating Images with Multimodal Language Models
Generating Images with Multimodal Language Models
Jing Yu Koh
Daniel Fried
Ruslan Salakhutdinov
MLLM
162
259
0
26 May 2023
Manifold Regularization for Memory-Efficient Training of Deep Neural
  Networks
Manifold Regularization for Memory-Efficient Training of Deep Neural Networks
Shadi Sartipi
Edgar A. Bernal
50
0
0
26 May 2023
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Jannik Kossen
Mark Collier
Basil Mustafa
Tianlin Li
Xiaohua Zhai
Lucas Beyer
Andreas Steiner
Jesse Berent
Rodolphe Jenatton
Efi Kokiopoulou
VLM
58
13
0
26 May 2023
Discovering Novel Actions from Open World Egocentric Videos with
  Object-Grounded Visual Commonsense Reasoning
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
Sanjoy Kundu
Shubham Trehan
Sathyanarayanan N. Aakur
LRMLM&Ro
71
3
0
26 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan O. Arik
Yihe Dong
Tomas Pfister
57
4
0
26 May 2023
Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical
  Invariance
Image as First-Order Norm+Linear Autoregression: Unveiling Mathematical Invariance
Yinpeng Chen
Xiyang Dai
Dongdong Chen
Mengchen Liu
Lu Yuan
Zicheng Liu
Youzuo Lin
101
2
0
25 May 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
83
23
0
25 May 2023
Breaking the Curse of Quality Saturation with User-Centric Ranking
Breaking the Curse of Quality Saturation with User-Centric Ranking
Zhuokai Zhao
Yang Yang
Wenyu Wang
Chi-Yu Liu
Yunluo Shi
Wenjie Hu
Haotian Zhang
Shuangjun Yang
68
3
0
24 May 2023
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and
  Compositional Experts
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts
Yunshui Li
Binyuan Hui
Zhichao Yin
Min Yang
Fei Huang
Yongbin Li
MoE
87
21
0
24 May 2023
Exploring Diverse In-Context Configurations for Image Captioning
Exploring Diverse In-Context Configurations for Image Captioning
Xu Yang
Yongliang Wu
Mingzhuo Yang
Haokun Chen
Xin Geng
MLLM
87
64
0
24 May 2023
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Haoxuan You
Rui Sun
Zhecan Wang
Long Chen
Gengyu Wang
Hammad A. Ayyubi
Kai-Wei Chang
Shih-Fu Chang
VLMMLLMLRM
148
44
0
24 May 2023
VIP5: Towards Multimodal Foundation Models for Recommendation
VIP5: Towards Multimodal Foundation Models for Recommendation
Shijie Geng
Juntao Tan
Shuchang Liu
Zuohui Fu
Yongfeng Zhang
128
78
0
23 May 2023
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
  Captions
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Sangwoo Mo
Minkyu Kim
Kyungmin Lee
Jinwoo Shin
VLMCLIP
128
25
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained
  Vision-Language Model
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIPVLM
129
27
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLMCLIP
102
18
0
22 May 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
152
64
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large
  Language Models
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
71
8
0
22 May 2023
Previous
123...121314...171819
Next