ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 935 papers shown
Title
Segment Any 3D Object with Language
Segment Any 3D Object with Language
Seungjun Lee
Yuyang Zhao
Gim Hee Lee
79
1
0
02 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language
  Models
Iterated Learning Improves Compositionality in Large Vision-Language Models
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLMCoGe
88
12
0
02 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jienneg Chen
Qihang Yu
Xiaohui Shen
Alan Yuille
Liang-Chieh Chen
3DVVLM
101
29
0
02 Apr 2024
Fashion Style Editing with Generative Human Prior
Fashion Style Editing with Generative Human Prior
Chaerin Kong
Seungyong Lee
Soohyeok Im
Wonsuk Yang
101
0
0
02 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLMMLLM
73
3
0
02 Apr 2024
Streaming Dense Video Captioning
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
107
42
0
01 Apr 2024
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Agneet Chatterjee
Gabriela Ben-Melech Stan
Estelle Aflalo
Sayak Paul
Dhruba Ghosh
...
Ludwig Schmidt
Hanna Hajishirzi
Vasudev Lal
Chitta Baral
Yezhou Yang
EGVMVLM
116
18
0
01 Apr 2024
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
Yunsong Wang
Hanlin Chen
Gim Hee Lee
124
6
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLMLRM
102
40
0
28 Mar 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
85
7
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
148
7
0
28 Mar 2024
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for
  Vision-Language Models
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models
Saurav Jha
Dong Gong
Lina Yao
CLIPVLM
152
11
0
28 Mar 2024
Language Plays a Pivotal Role in the Object-Attribute Compositional
  Generalization of CLIP
Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Reza Abbasi
Mohammad Samiei
M. Rohban
M. Baghshah
VLMCoGe
62
0
0
27 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering
  Using a VLM
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
108
56
0
27 Mar 2024
Residual-based Language Models are Free Boosters for Biomedical Imaging
Residual-based Language Models are Free Boosters for Biomedical Imaging
Zhixin Lai
Jing Wu
Suiyao Chen
Yucheng Zhou
N. Hovakimyan
MedIm
94
31
0
26 Mar 2024
DreamLIP: Language-Image Pre-training with Long Captions
DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng
Yifei Zhang
Wei Wu
Fan Lu
Shuailei Ma
Xin Jin
Wei Chen
Yujun Shen
VLMCLIP
121
29
0
25 Mar 2024
Open-Set Recognition in the Age of Vision-Language Models
Open-Set Recognition in the Age of Vision-Language Models
Dimity Miller
Niko Sünderhauf
Alex Kenna
Keita Mason
VLM
66
6
0
25 Mar 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yuchen Suo
Fan Ma
Linchao Zhu
Yi Yang
82
24
0
24 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
88
79
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLMAI4TS
90
4
0
21 Mar 2024
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Yiwei Zhou
Xiaobo Xia
Zhiwei Lin
Bo Han
Tongliang Liu
VLM
106
16
0
21 Mar 2024
MyVLM: Personalizing VLMs for User-Specific Queries
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf
Elad Richardson
Sergey Tulyakov
Kfir Aberman
Daniel Cohen-Or
MLLMVLM
107
23
0
21 Mar 2024
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship
  Detection
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Tim Salzmann
Markus Ryll
Alex Bewley
Matthias Minderer
94
4
0
21 Mar 2024
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Jingjing Hu
Dan Guo
Kun Li
Zhan Si
Xun Yang
Xiaojun Chang
Meng Wang
132
3
0
21 Mar 2024
MTP: Advancing Remote Sensing Foundation Model via Multi-Task
  Pretraining
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining
Di Wang
Jing Zhang
Minqiang Xu
Lin Liu
Dongsheng Wang
...
Chengxi Han
Haonan Guo
Bo Du
Dacheng Tao
Lefei Zhang
83
52
0
20 Mar 2024
Vid2Robot: End-to-end Video-conditioned Policy Learning with
  Cross-Attention Transformers
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Vidhi Jain
Maria Attarian
Nikhil J. Joshi
Ayzaan Wahid
Danny Driess
...
Stefan Welker
Christine Chan
Igor Gilitschenski
Yonatan Bisk
Debidatta Dwibedi
136
32
0
19 Mar 2024
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT
  Adaptation
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation
Wangbo Zhao
Jiasheng Tang
Yizeng Han
Yibing Song
Kai Wang
Gao Huang
F. Wang
Yang You
125
12
0
18 Mar 2024
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
Zhenghao Zhang
Zuozhuo Dai
Long Qin
Weizhi Wang
DiffMVGen
72
2
0
18 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjDVLM
77
20
0
15 Mar 2024
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training
Zhixiu Lu
Hailong Li
N. Parikh
Jonathan R. Dillman
Lili He
MedImVLM
129
1
0
15 Mar 2024
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via
  Concept-guided Context Optimization
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization
Yequan Bie
Luyang Luo
Zhixuan Chen
Hao Chen
76
7
0
14 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A
  Multi-Aspect Vision-Language Pre-training Framework
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh-Son To
Johan Verjans
128
14
0
12 Mar 2024
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
  Visual Class Discovery
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
Haiyang Zheng
Nan Pu
Wenjing Li
N. Sebe
Zhun Zhong
91
7
0
12 Mar 2024
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
Sergey Kastryulin
Denis Prokopenko
Artem Babenko
Dmitry V. Dylov
61
0
0
11 Mar 2024
RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
Yuncheng Yang
Chuyan Zhang
Zuopeng Yang
Yuting Gao
Yulei Qin
Ke Li
Xing Sun
Jie Yang
Yun Gu
VLMVPVLM
123
0
0
10 Mar 2024
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
Ibrahim Alabdulmohsin
Xiao Wang
Andreas Steiner
Priya Goyal
Alexander DÁmour
Xiao-Qi Zhai
84
21
0
07 Mar 2024
Modeling Collaborator: Enabling Subjective Vision Classification With
  Minimal Human Effort via LLM Tool-Use
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal
Aditya Avinash
N. Alldrin
Jan Dlabal
Wenlei Zhou
...
Chun-Ta Lu
Howard Zhou
Ranjay Krishna
Ariel Fuxman
Tom Duerig
VLM
145
7
0
05 Mar 2024
HeAR -- Health Acoustic Representations
HeAR -- Health Acoustic Representations
Sebastien Baur
Zaid Nabulsi
Wei-Hung Weng
Jake Garrison
Louis Blankemeier
...
Shwetak N. Patel
S. Shetty
Shruthi Prabhakara
Monde Muyoyeta
Diego Ardila
LM&MA
55
14
0
04 Mar 2024
Differentially Private Representation Learning via Image Captioning
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi-An Ma
Kamalika Chaudhuri
Chuan Guo
95
4
0
04 Mar 2024
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary
  Action Recognition
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition
Kun-Yu Lin
Henghui Ding
Jiaming Zhou
Yu-Ming Tang
Yi-Xing Peng
Zhilin Zhao
Chen Change Loy
Wei-Shi Zheng
VLM
123
18
0
03 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
166
211
0
29 Feb 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the
  Open World
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
143
53
0
29 Feb 2024
SeD: Semantic-Aware Discriminator for Image Super-Resolution
SeD: Semantic-Aware Discriminator for Image Super-Resolution
Bingchen Li
Xin Li
Hanxin Zhu
Yeying Jin
Ruoyu Feng
Zhizheng Zhang
Zhibo Chen
SupR
101
23
0
29 Feb 2024
Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy,
  Advances, and Outlook
Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook
Xingchen Zou
Yibo Yan
Xixuan Hao
Yuehong Hu
Haomin Wen
...
Junbo Zhang
Yong Li
Tianrui Li
Yu Zheng
Yuxuan Liang
HAIAI4TS
104
45
0
29 Feb 2024
Generalizable Whole Slide Image Classification with Fine-Grained
  Visual-Semantic Interaction
Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
Hao Li
Ying Chen
Yifei Chen
Wenxian Yang
Bowen Ding
Yuchen Han
Liansheng Wang
Rongshan Yu
105
19
0
29 Feb 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities
  of Large Vision Models
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLMVGenEGVM
191
300
0
27 Feb 2024
CLIPose: Category-Level Object Pose Estimation with Pre-trained
  Vision-Language Knowledge
CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge
Xiao Lin
Minghao Zhu
Ronghao Dang
Guangliang Zhou
Shaolong Shu
Feng Lin
Chengju Liu
Qi Chen
CLIP
119
9
0
24 Feb 2024
User-LLM: Efficient LLM Contextualization with User Embeddings
User-LLM: Efficient LLM Contextualization with User Embeddings
Lin Ning
Luyang Liu
Jiaxing Wu
Neo Wu
D. Berlowitz
Sushant Prakash
Bradley Green
S. O’Banion
Jun Xie
99
38
0
21 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
123
36
0
20 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
  Vision-language Adapter
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan Yuille
Shen Yan
Boyu Wang
39
3
0
16 Feb 2024
Previous
123...678...171819
Next