Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 935 papers shown
Title
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDE
VLM
155
87
0
15 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
108
16
0
15 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
95
6
0
14 Feb 2024
Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision
Zhaoqing Wang
Xiaobo Xia
Ziye Chen
Xiao He
Yandong Guo
Biwei Huang
Tongliang Liu
VLM
98
13
0
14 Feb 2024
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
Fei Deng
Qifei Wang
Wei Wei
Matthias Grundmann
Tingbo Hou
EGVM
82
21
0
13 Feb 2024
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
Michael Dorkenwald
Nimrod Barazani
Cees G. M. Snoek
Yuki M. Asano
VLM
MLLM
59
12
0
13 Feb 2024
Towards a Foundation Model for Brain Age Prediction using coVariance Neural Networks
Saurabh Sihag
Gonzalo Mateos
Alejandro Ribeiro
70
6
0
12 Feb 2024
An Empirical Study Into What Matters for Calibrating Vision-Language Models
Weijie Tu
Weijian Deng
Dylan Campbell
Stephen Gould
Tom Gedeon
VLM
88
8
0
12 Feb 2024
Distilling Symbolic Priors for Concept Learning into Neural Networks
Ioana Marinescu
R. Thomas McCoy
Thomas Griffiths
75
2
0
10 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
92
12
0
10 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
130
6
0
08 Feb 2024
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors
Sheng Jin
Xue-Qiu Jiang
Jiaxing Huang
Lewei Lu
Shijian Lu
VLM
ObjD
91
26
0
07 Feb 2024
Progress and Opportunities of Foundation Models in Bioinformatics
Qing Li
Zhihang Hu
Yixuan Wang
Lei Li
Yimin Fan
Irwin King
Le Song
Yu Li
AI4CE
85
18
0
06 Feb 2024
Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap
Christopher Liao
Christian So
Theodoros Tsiligkaridis
Brian Kulis
86
0
0
06 Feb 2024
Image-Caption Encoding for Improving Zero-Shot Generalization
Eric Yang Yu
Christopher Liao
Sathvik Ravi
Theodoros Tsiligkaridis
Brian Kulis
OODD
VLM
49
0
0
05 Feb 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
80
3
0
31 Jan 2024
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MA
VLM
128
27
0
27 Jan 2024
Segment Any Cell: A SAM-based Auto-prompting Fine-tuning Framework for Nuclei Segmentation
Saiyang Na
Yuzhi Guo
Feng Jiang
Hehuan Ma
Junzhou Huang
VLM
MedIm
86
16
0
24 Jan 2024
On the Efficacy of Text-Based Input Modalities for Action Anticipation
Apoorva Beedu
Karan Samel
Irfan Essa
102
2
0
23 Jan 2024
Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?
Cheng Han
Qifan Wang
Yiming Cui
Wenguan Wang
Lifu Huang
Siyuan Qi
Dongfang Liu
VLM
155
22
0
23 Jan 2024
CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Zhihong Chen
Maya Varma
Jean-Benoit Delbrouck
Magdalini Paschali
Louis Blankemeier
...
Cameron Olsen
Tanishq Mathew Abraham
S. Gatidis
Akshay S. Chaudhari
Curtis P. Langlotz
MedIm
LM&MA
61
22
0
22 Jan 2024
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Katherine Crowson
Stefan Andreas Baumann
Alex Birch
Tanishq Mathew Abraham
Daniel Z. Kaplan
Enrico Shippole
101
55
0
21 Jan 2024
Exploring scalable medical image encoders beyond text supervision
Fernando Pérez-García
Harshita Sharma
Sam Bond-Taylor
Kenza Bouzid
Valentina Salvatelli
...
Maria T. A. Wetscherek
Noel C. F. Codella
Stephanie L. Hyland
Javier Alvarez-Valle
Ozan Oktay
LM&MA
MedIm
139
9
0
19 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLM
CLIP
87
9
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
145
49
0
18 Jan 2024
GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition
Guangzhao Dai
Xiangbo Shu
Wenhao Wu
Rui Yan
Jiachao Zhang
VLM
108
7
0
18 Jan 2024
Improving fine-grained understanding in image-text pre-training
Ioana Bica
Anastasija Ilić
Matthias Bauer
Goker Erdogan
Matko Bovsnjak
...
A. Gritsenko
Matthias Minderer
Charles Blundell
Razvan Pascanu
Jovana Mitrović
VLM
75
27
0
18 Jan 2024
Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
Ze-Long Cheng
Kehan Li
Hao Li
Peng Jin
Chang Liu
Xiawu Zheng
Rongrong Ji
Jie Chen
VOS
85
2
0
18 Jan 2024
Scalable Pre-training of Large Autoregressive Image Models
Alaaeldin El-Nouby
Michal Klein
Shuangfei Zhai
Miguel Angel Bautista
Alexander Toshev
Vaishaal Shankar
J. Susskind
Armand Joulin
VLM
105
80
0
16 Jan 2024
Concept-Guided Prompt Learning for Generalization in Vision-Language Models
Yi Zhang
Ce Zhang
Ke Yu
Yushun Tang
Zhihai He
VLM
MLLM
92
24
0
15 Jan 2024
Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model
Taehee Kim
Yeongjae Cho
Heejun Shin
Yohan Jo
Dongmyung Shin
103
4
0
12 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
91
14
0
11 Jan 2024
Evaluating Data Augmentation Techniques for Coffee Leaf Disease Classification
Adrian Gheorghiu
Iulian-Marius Taiatu
Dumitru-Clementin Cercel
Iuliana Marin
Florin-Catalin Pop
79
2
0
11 Jan 2024
Learning to Prompt with Text Only Supervision for Vision-Language Models
Muhammad Uzair Khattak
Muhammad Ferjad Naeem
Muzammal Naseer
Luc Van Gool
F. Tombari
VLM
VPVLM
94
22
0
04 Jan 2024
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
77
3
0
04 Jan 2024
Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang
Jin Gao
Zheling Tan
Lingfeng Zhou
Kexin Ding
Mu Zhou
Shaoting Zhang
Dequan Wang
AI4CE
113
25
0
04 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
92
10
0
03 Jan 2024
Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
Kyle Buettner
Sina Malakouti
Xiang Lorraine Li
Adriana Kovashka
128
3
0
03 Jan 2024
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen
Yi Hong
MedIm
120
2
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
Kevin Qinghong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
99
12
0
01 Jan 2024
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu
Zicheng Zhang
Weixia Zhang
Chaofeng Chen
Liang Liao
...
Wenxiu Sun
Qiong Yan
Xiongkuo Min
Guangtao Zhai
Weisi Lin
85
163
0
28 Dec 2023
Prompt Expansion for Adaptive Text-to-Image Generation
Siddhartha Datta
Alexander Ku
Deepak Ramachandran
Peter Anderson
DiffM
68
10
0
27 Dec 2023
LeanVec: Searching vectors faster by making them fit
Mariano Tepper
Ishwar Bhati
Cecilia Aguerrebere
Mark Hildebrand
Ted Willke
VLM
OODD
75
2
0
26 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
106
18
0
25 Dec 2023
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
66
1
0
22 Dec 2023
Leveraging Habitat Information for Fine-grained Bird Identification
Tin Nguyen
Peijie Chen
Anh Totti Nguyen
VLM
113
0
0
22 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
275
1,216
0
21 Dec 2023
LingoQA: Video Question Answering for Autonomous Driving
Ana-Maria Marcu
Long Chen
Jan Hünermann
Alice Karnsund
Benoît Hanotte
...
Vijay Badrinarayanan
Alex Kendall
Jamie Shotton
Elahe Arani
Oleg Sinavski
62
45
0
21 Dec 2023
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast
Guangyin Bao
Qi Zhang
Duoqian Miao
Zixuan Gong
Liang Hu
Ke Liu
Yang Liu
Chongyang Shi
80
10
0
21 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
64
1
0
21 Dec 2023
Previous
1
2
3
...
7
8
9
...
17
18
19
Next