Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 935 papers shown
Title
Gloss-Free End-to-End Sign Language Translation
Kezhou Lin
Xiaohan Wang
Linchao Zhu
Ke Sun
Bang Zhang
Yezhou Yang
SLR
70
20
0
22 May 2023
Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach
Haoning Wu
Erli Zhang
Liang Liao
Chaofeng Chen
Jingwen Hou
Annan Wang
Wenxiu Sun
Qiong Yan
Weisi Lin
85
40
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
63
2
0
21 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
58
1
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
151
122
0
18 May 2023
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts
Qiuhui Chen
Xinyue Hu
Zirui Wang
Yi Hong
LM&MA
MedIm
63
40
0
18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
144
85
0
17 May 2023
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSL
CLIP
VLM
128
23
0
15 May 2023
OneCAD: One Classifier for All image Datasets using multimodal learning
S. Wadekar
Eugenio Culurciello
108
0
0
11 May 2023
Simple Token-Level Confidence Improves Caption Correctness
Suzanne Petryk
Spencer Whitehead
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
90
7
0
11 May 2023
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
115
58
0
11 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
ViT
VLM
84
80
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
147
142
0
11 May 2023
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Hassan Akbari
Dan Kondratyuk
Huayu Chen
Rachel Hornung
Haoran Wang
Hartwig Adam
VLM
MoE
105
13
0
10 May 2023
Visual Tuning
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
171
41
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
195
943
0
09 May 2023
Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness
Liangliang Cao
Bowen Zhang
Chen Chen
Yinfei Yang
Xianzhi Du
Wen‐Cheng Zhang
Zhiyun Lu
Yantao Zheng
CLIP
VLM
75
15
0
08 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
125
25
0
06 May 2023
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
Mathis Petrovich
Michael J. Black
Gül Varol
VGen
125
85
0
02 May 2023
Multimodal Neural Databases
Giovanni Trappolini
Andrea Santilli
Emanuele Rodolà
A. Halevy
Fabrizio Silvestri
101
10
0
02 May 2023
What Do Self-Supervised Vision Transformers Learn?
Namuk Park
Wonjae Kim
Byeongho Heo
Taekyung Kim
Sangdoo Yun
SSL
176
80
1
01 May 2023
Adversarial Representation Learning for Robust Privacy Preservation in Audio
Shayan Gharib
Minh Tran
Diep Luong
Konstantinos Drossos
Tuomas Virtanen
AAML
43
5
0
29 Apr 2023
An Empirical Study of Multimodal Model Merging
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
115
42
0
28 Apr 2023
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Haoning Wu
Liang Liao
Annan Wang
Chaofeng Chen
Jingwen Hou
Wenxiu Sun
Qiong Yan
Weisi Lin
102
15
0
28 Apr 2023
Retrieval-based Knowledge Augmented Vision Language Pre-training
Jiahua Rao
Zifei Shan
Long Liu
Yao Zhou
Yuedong Yang
VLM
163
14
0
27 Apr 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Seulki Park
Daeho Um
Hajung Yoon
Sanghyuk Chun
Sangdoo Yun
Hawook Jeong
93
3
0
21 Apr 2023
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Yihao Chen
Xianbiao Qi
Jianan Wang
Lei Zhang
82
18
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
136
112
0
17 Apr 2023
Permutation Equivariance of Transformers and Its Applications
Hengyuan Xu
Liyao Xiang
Hang Ye
Dixi Yao
Pengzhi Chu
Baochun Li
54
15
0
16 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
77
11
0
14 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
82
45
0
13 Apr 2023
RECLIP: Resource-efficient CLIP by Training with Small Images
Runze Li
Dahun Kim
B. Bhanu
Weicheng Kuo
VLM
CLIP
76
13
0
12 Apr 2023
Gradient-Free Textual Inversion
Zhengcong Fei
Mingyuan Fan
Junshi Huang
DiffM
114
33
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
59
4
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM
3DV
83
26
0
11 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
116
7
0
09 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
68
14
0
06 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
105
22
0
05 Apr 2023
Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks
Michael Weiss
Paolo Tonella
AI4CE
54
0
0
05 Apr 2023
ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules
Zhi-Qi Cheng
Qianwen Dai
Siyao Li
Jingdong Sun
Teruko Mitamura
Alexander G. Hauptmann
76
22
0
05 Apr 2023
Uncertainty estimation in Deep Learning for Panoptic segmentation
Michael J. Smith
F. Ferrie
OOD
UQCV
65
0
0
04 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
94
6
0
04 Apr 2023
Black Box Few-Shot Adaptation for Vision-Language models
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
87
35
0
04 Apr 2023
Exploring Vision-Language Models for Imbalanced Learning
Yidong Wang
Zhuohao Yu
Jindong Wang
Qiang Heng
Haoxing Chen
Wei Ye
Rui Xie
Xingxu Xie
Shi-Bo Zhang
VLM
105
33
0
04 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
165
551
0
03 Apr 2023
From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding
Yong-Lu Li
Xiaoqian Wu
Xinpeng Liu
Zehao Wang
Yiming Dou
...
Junyi Zhang
Yixing Li
Jingru Tan
Xudong Lu
Cewu Lu
169
17
0
02 Apr 2023
SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger
Yuting Gao
Jinfeng Liu
Zi-Han Xu
Tong Wu
Wen Liu
Jie Yang
Keren Li
Xingen Sun
CLIP
VLM
64
47
0
30 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Tianlin Li
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
130
9
0
30 Mar 2023
AutoAD: Movie Description in Context
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
77
35
0
29 Mar 2023
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Kun Su
Kaizhi Qian
Eli Shlizerman
Antonio Torralba
Chuang Gan
VGen
AI4CE
87
20
0
29 Mar 2023
Previous
1
2
3
...
13
14
15
...
17
18
19
Next