ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 935 papers shown
Title
Revisiting Feature Prediction for Learning Visual Representations from
  Video
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDEVLM
155
87
0
15 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
108
16
0
15 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
95
6
0
14 Feb 2024
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
Zhaoqing Wang
Xiaobo Xia
Ziye Chen
Xiao He
Yandong Guo
Biwei Huang
Tongliang Liu
VLM
98
13
0
14 Feb 2024
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward
  Finetuning of Diffusion Models
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
Fei Deng
Qifei Wang
Wei Wei
Matthias Grundmann
Tingbo Hou
EGVM
82
21
0
13 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLMMLLM
59
12
0
13 Feb 2024
Towards a Foundation Model for Brain Age Prediction using coVariance
  Neural Networks
Towards a Foundation Model for Brain Age Prediction using coVariance Neural Networks
Saurabh Sihag
Gonzalo Mateos
Alejandro Ribeiro
70
6
0
12 Feb 2024
An Empirical Study Into What Matters for Calibrating Vision-Language
  Models
An Empirical Study Into What Matters for Calibrating Vision-Language Models
Weijie Tu
Weijian Deng
Dylan Campbell
Stephen Gould
Tom Gedeon
VLM
88
8
0
12 Feb 2024
Distilling Symbolic Priors for Concept Learning into Neural Networks
Distilling Symbolic Priors for Concept Learning into Neural Networks
Ioana Marinescu
R. Thomas McCoy
Thomas Griffiths
75
2
0
10 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
92
12
0
10 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
130
6
0
08 Feb 2024
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
  Descriptors
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors
Sheng Jin
Xue-Qiu Jiang
Jiaxing Huang
Lewei Lu
Shijian Lu
VLMObjD
91
26
0
07 Feb 2024
Progress and Opportunities of Foundation Models in Bioinformatics
Progress and Opportunities of Foundation Models in Bioinformatics
Qing Li
Zhihang Hu
Yixuan Wang
Lei Li
Yimin Fan
Irwin King
Le Song
Yu Li
AI4CE
85
18
0
06 Feb 2024
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao
Christian So
Theodoros Tsiligkaridis
Brian Kulis
86
0
0
06 Feb 2024
Image-Caption Encoding for Improving Zero-Shot Generalization
Image-Caption Encoding for Improving Zero-Shot Generalization
Eric Yang Yu
Christopher Liao
Sathvik Ravi
Theodoros Tsiligkaridis
Brian Kulis
OODDVLM
49
0
0
05 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
80
3
0
31 Jan 2024
A Survey on Data Augmentation in Large Model Era
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MAVLM
128
27
0
27 Jan 2024
Segment Any Cell: A SAM-based Auto-prompting Fine-tuning Framework for
  Nuclei Segmentation
Segment Any Cell: A SAM-based Auto-prompting Fine-tuning Framework for Nuclei Segmentation
Saiyang Na
Yuzhi Guo
Feng Jiang
Hehuan Ma
Junzhou Huang
VLMMedIm
86
16
0
24 Jan 2024
On the Efficacy of Text-Based Input Modalities for Action Anticipation
On the Efficacy of Text-Based Input Modalities for Action Anticipation
Apoorva Beedu
Karan Samel
Irfan Essa
102
2
0
23 Jan 2024
Facing the Elephant in the Room: Visual Prompt Tuning or Full
  Finetuning?
Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?
Cheng Han
Qifan Wang
Yiming Cui
Wenguan Wang
Lifu Huang
Siyuan Qi
Dongfang Liu
VLM
155
22
0
23 Jan 2024
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Zhihong Chen
Maya Varma
Jean-Benoit Delbrouck
Magdalini Paschali
Louis Blankemeier
...
Cameron Olsen
Tanishq Mathew Abraham
S. Gatidis
Akshay S. Chaudhari
Curtis P. Langlotz
MedImLM&MA
61
22
0
22 Jan 2024
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
  Diffusion Transformers
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Katherine Crowson
Stefan Andreas Baumann
Alex Birch
Tanishq Mathew Abraham
Daniel Z. Kaplan
Enrico Shippole
101
55
0
21 Jan 2024
Exploring scalable medical image encoders beyond text supervision
Exploring scalable medical image encoders beyond text supervision
Fernando Pérez-García
Harshita Sharma
Sam Bond-Taylor
Kenza Bouzid
Valentina Salvatelli
...
Maria T. A. Wetscherek
Noel C. F. Codella
Stephanie L. Hyland
Javier Alvarez-Valle
Ozan Oktay
LM&MAMedIm
139
9
0
19 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLMCLIP
87
9
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via
  Multi-modal Feature Synchronizer
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
145
49
0
18 Jan 2024
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot
  Egocentric Action Recognition
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition
Guangzhao Dai
Xiangbo Shu
Wenhao Wu
Rui Yan
Jiachao Zhang
VLM
108
7
0
18 Jan 2024
Improving fine-grained understanding in image-text pre-training
Improving fine-grained understanding in image-text pre-training
Ioana Bica
Anastasija Ilić
Matthias Bauer
Goker Erdogan
Matko Bovsnjak
...
A. Gritsenko
Matthias Minderer
Charles Blundell
Razvan Pascanu
Jovana Mitrović
VLM
75
27
0
18 Jan 2024
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance
  Segmentation
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
Ze-Long Cheng
Kehan Li
Hao Li
Peng Jin
Chang Liu
Xiawu Zheng
Rongrong Ji
Jie Chen
VOS
85
2
0
18 Jan 2024
Scalable Pre-training of Large Autoregressive Image Models
Scalable Pre-training of Large Autoregressive Image Models
Alaaeldin El-Nouby
Michal Klein
Shuangfei Zhai
Miguel Angel Bautista
Alexander Toshev
Vaishaal Shankar
J. Susskind
Armand Joulin
VLM
105
80
0
16 Jan 2024
Concept-Guided Prompt Learning for Generalization in Vision-Language
  Models
Concept-Guided Prompt Learning for Generalization in Vision-Language Models
Yi Zhang
Ce Zhang
Ke Yu
Yushun Tang
Zhihai He
VLMMLLM
92
24
0
15 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written
  Questions via a Chain of QA with a Large Language Model
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
103
4
0
12 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
91
14
0
11 Jan 2024
Evaluating Data Augmentation Techniques for Coffee Leaf Disease
  Classification
Evaluating Data Augmentation Techniques for Coffee Leaf Disease Classification
Adrian Gheorghiu
Iulian-Marius Taiatu
Dumitru-Clementin Cercel
Iuliana Marin
Florin-Catalin Pop
79
2
0
11 Jan 2024
Learning to Prompt with Text Only Supervision for Vision-Language Models
Learning to Prompt with Text Only Supervision for Vision-Language Models
Muhammad Uzair Khattak
Muhammad Ferjad Naeem
Muzammal Naseer
Luc Van Gool
F. Tombari
VLMVPVLM
94
22
0
04 Jan 2024
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
  Multimodal Alignment
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
77
3
0
04 Jan 2024
Data-Centric Foundation Models in Computational Healthcare: A Survey
Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang
Jin Gao
Zheling Tan
Lingfeng Zhou
Kexin Ding
Mu Zhou
Shaoting Zhang
Dequan Wang
AI4CE
113
25
0
04 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRMVLM
92
10
0
03 Jan 2024
Incorporating Geo-Diverse Knowledge into Prompting for Increased
  Geographical Robustness in Object Recognition
Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
Kyle Buettner
Sina Malakouti
Xiang Lorraine Li
Adriana Kovashka
128
3
0
03 Jan 2024
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen
Yi Hong
MedIm
120
2
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
  Pre-Training
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
Kevin Qinghong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLMVGen
99
12
0
01 Jan 2024
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
  Levels
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu
Zicheng Zhang
Weixia Zhang
Chaofeng Chen
Liang Liao
...
Wenxiu Sun
Qiong Yan
Xiongkuo Min
Guangtao Zhai
Weisi Lin
85
163
0
28 Dec 2023
Prompt Expansion for Adaptive Text-to-Image Generation
Prompt Expansion for Adaptive Text-to-Image Generation
Siddhartha Datta
Alexander Ku
Deepak Ramachandran
Peter Anderson
DiffM
68
10
0
27 Dec 2023
LeanVec: Searching vectors faster by making them fit
LeanVec: Searching vectors faster by making them fit
Mariano Tepper
Ishwar Bhati
Cecilia Aguerrebere
Mark Hildebrand
Ted Willke
VLMOODD
75
2
0
26 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
106
18
0
25 Dec 2023
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
  and Variances
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
66
1
0
22 Dec 2023
Leveraging Habitat Information for Fine-grained Bird Identification
Leveraging Habitat Information for Fine-grained Bird Identification
Tin Nguyen
Peijie Chen
Anh Totti Nguyen
VLM
113
0
0
22 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
275
1,216
0
21 Dec 2023
LingoQA: Video Question Answering for Autonomous Driving
LingoQA: Video Question Answering for Autonomous Driving
Ana-Maria Marcu
Long Chen
Jan Hünermann
Alice Karnsund
Benoît Hanotte
...
Vijay Badrinarayanan
Alex Kendall
Jamie Shotton
Elahe Arani
Oleg Sinavski
62
45
0
21 Dec 2023
Multimodal Federated Learning with Missing Modality via Prototype Mask
  and Contrast
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast
Guangyin Bao
Qi Zhang
Duoqian Miao
Zixuan Gong
Liang Hu
Ke Liu
Yang Liu
Chongyang Shi
80
10
0
21 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
  Multimodal and Language Models
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
64
1
0
21 Dec 2023
Previous
123...789...171819
Next