ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 935 papers shown
Title
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts
Anh-Quan Cao
M. Jaritz
Matthieu Guillaumin
Raoul de Charette
Loris Bazzani
VLMCLIP
105
2
0
10 Oct 2024
Evaluating Computational Pathology Foundation Models for Prostate Cancer
  Grading under Distribution Shifts
Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts
Fredrik K. Gustafsson
Mattias Rantalainen
OODMedIm
83
1
0
09 Oct 2024
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
128
2
0
09 Oct 2024
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation
  Models
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models
Rabin Adhikari
Safal Thapaliya
Manish Dhakal
Bishesh Khanal
MLLMVLM
77
0
0
07 Oct 2024
Uncertainty-Guided Enhancement on Driving Perception System via
  Foundation Models
Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models
Yunhao Yang
Yuxin Hu
Mao Ye
Zaiwei Zhang
Zhichao Lu
Yi Xu
Ufuk Topcu
Ben Snyder
85
2
0
02 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
175
0
0
01 Oct 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
122
9
0
30 Sep 2024
FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image
  Classification
FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image Classification
Kexue Fu
Xiaoyuan Luo
Linhao Qu
Shuo Wang
Ying Xiong
Ilias Maglogiannis
Longxiang Gao
Manning Wang
60
2
0
29 Sep 2024
Vision-Language Models are Strong Noisy Label Detectors
Vision-Language Models are Strong Noisy Label Detectors
Tong Wei
Haoyang Li
Chun-Shu Li
Jiang-Xin Shi
Yu-Feng Li
Min-Ling Zhang
VLM
76
9
0
29 Sep 2024
From Vision to Audio and Beyond: A Unified Model for Audio-Visual
  Representation and Generation
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
Kun Su
Xiulong Liu
Eli Shlizerman
VGen
163
7
0
27 Sep 2024
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation
  of Visual-knowledge-linguistic Features
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features
Xin Wei
Yaling Tao
Changde Du
Gangming Zhao
Yizhou Yu
Jinpeng Li
95
0
0
24 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language
  Model
LARE: Latent Augmentation using Regional Embedding with Vision-Language Model
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
76
0
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
148
2
0
19 Sep 2024
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
Kalakonda Sai Shashank
Shubh Maheshwari
Ravi Kiran Sarvadevabhatla
VGenDiffM
105
3
0
18 Sep 2024
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval
Amirreza Mahbod
Nematollah Saeidi
Sepideh Hatamikia
Ramona Woitek
VLMMedIm
126
4
0
14 Sep 2024
Phikon-v2, A large and public feature extractor for biomarker prediction
Phikon-v2, A large and public feature extractor for biomarker prediction
Alexandre Filiot
Paul Jacob
Alice Mac Kain
Charlie Saillard
MedIm
87
21
0
13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGeVLM
56
0
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
105
6
0
11 Sep 2024
CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS
  Differently
CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently
Jonathan Zalach
Inbal Gazy
Assaf Avinoam
Ron Sinai
Eran Shmuel
Inbar Gilboa
Christine Swisher
Naim Matasci
Reva Basho
David B. Agus
76
0
0
04 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
459
1
0
04 Sep 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age
  of Rising Multi-Modal Large Language Models
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
K. Nakata
Daisuke Miyashita
Youyang Ng
Yasuto Hoshi
J. Deguchi
64
0
0
29 Aug 2024
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
Junyao Ge
Xu Zhang
Yang Zheng
Kaitai Guo
Jimin Liang
171
2
0
27 Aug 2024
A New Era in Computational Pathology: A Survey on Foundation and
  Vision-Language Models
A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models
Dibaloke Chanda
Milan Aryal
Nasim Yahya Soltani
Masoud Ganji
AI4CEVLM
133
7
0
23 Aug 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare?
  A Comprehensive Survey
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey
Qika Lin
Yifan Zhu
Xin Mei
Ling Huang
Jingying Ma
Kai He
Zhen Peng
Min Zhang
Mengling Feng
109
23
0
23 Aug 2024
XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary
  Classification of Chest X-Rays
XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-Rays
Umaima Rahman
Abhishek Basu
Muhammad Uzair Khattak
Aniq Ur Rahman
MedIm
66
0
0
21 Aug 2024
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared
  Person Re-Identification
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-Identification
Yonggan Wu
Ling-Chao Meng
Yuan Zichao
Sixian Chan
Hong-Qiang Wang
103
4
0
20 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLMSLR
68
3
0
19 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language
  Compositionality
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
61
0
0
18 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan O. Arik
Tejas Nama
Tomas Pfister
74
1
0
13 Aug 2024
Contrastive masked auto-encoders based self-supervised hashing for 2D
  image and 3D point cloud cross-modal retrieval
Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval
Rukai Wei
Heng Cui
Yu Liu
Yufeng Hou
Yanzhao Xie
Ke Zhou
3DPC
47
0
0
11 Aug 2024
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic
  Segmentation
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
Dahyun Kang
Minsu Cho
ObjDVLM
140
11
0
09 Aug 2024
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
  Scaling
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
Haider Al-Tahan
Q. Garrido
Randall Balestriero
Diane Bouchacourt
C. Hazirbas
Mark Ibrahim
VLM
135
10
0
09 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language
  Modeling
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas Guibas
P. Milanfar
Feng Yang
98
2
0
07 Aug 2024
Multistain Pretraining for Slide Representation Learning in Pathology
Multistain Pretraining for Slide Representation Learning in Pathology
Guillaume Jaume
Anurag J. Vaidya
Andrew Zhang
Andrew H. Song
Richard J. Chen
S. Sahai
Dandan Mo
Emilio Madrigal
L. Le
Faisal Mahmood
109
14
0
05 Aug 2024
Text-Guided Video Masked Autoencoder
Text-Guided Video Masked Autoencoder
D. Fan
Jue Wang
Shuai Liao
Zhikang Zhang
Vimal Bhat
Xinyu Li
VGen
57
3
0
01 Aug 2024
Conditioned Prompt-Optimization for Continual Deepfake Detection
Conditioned Prompt-Optimization for Continual Deepfake Detection
Francesco Laiti
Benedetta Liberatori
Thomas De Min
Elisa Ricci
116
3
0
31 Jul 2024
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language
  Models
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models
Ali Abdollahi
Mahdi Ghaznavi
Mohammad Reza Karimi Nejad
Arash Mari Oriyad
Reza Abbasi
Ali Salesi
Melika Behjati
M. Rohban
M. Baghshah
CoGe
137
1
0
30 Jul 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music
  Descriptions
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Xiaowei Chi
Yatian Wang
Aosong Cheng
Pengjun Fang
Zeyue Tian
...
Wenhan Luo
Qifeng Chen
Shanghang Zhang
Qi-fei Liu
Yi-Ting Guo
129
7
0
30 Jul 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human Attention
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
90
2
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
99
0
0
28 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language
  Alignment
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng Zhang
Tong He
VLM
59
4
0
25 Jul 2024
QPT V2: Masked Image Modeling Advances Visual Scoring
QPT V2: Masked Image Modeling Advances Visual Scoring
Qizhi Xie
Kun Yuan
Yunpeng Qu
Mingda Wu
Ming Sun
Chao Zhou
Jihong Zhu
83
3
0
23 Jul 2024
Improved Few-Shot Image Classification Through Multiple-Choice Questions
Improved Few-Shot Image Classification Through Multiple-Choice Questions
Dipika Khullar
Emmett Goodman
Negin Sokhandan
54
0
0
23 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
112
29
0
22 Jul 2024
Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language
  Encoders
Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders
Laura Niss
Kevin Vogt-Lowell
Theodoros Tsiligkaridis
VLM
95
1
0
22 Jul 2024
In-Context Learning Improves Compositional Understanding of
  Vision-Language Models
In-Context Learning Improves Compositional Understanding of Vision-Language Models
Matteo Nulli
Anesa Ibrahimi
Avik Pal
Hoshe Lee
Ivona Najdenkoska
VLMCoGe
75
0
0
22 Jul 2024
A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
Yingxue Xu
Yihui Wang
Fengtao Zhou
Jiabo Ma
Shu Yang
...
Anjia Han
Ronald Cheong Kin Chan
Li Liang
Xiuming Zhang
Hao Chen
128
22
0
22 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
42
0
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOSLRM
178
5
0
18 Jul 2024
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language
  Inference
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan
Chaofeng Chen
Yiping Ke
Xinjiang Wang
Xue Jiang
Wayne Zhang
VLM
117
29
0
17 Jul 2024
Previous
12345...171819
Next