Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 935 papers shown
Title
Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue
Kumar Ashutosh
Kristen Grauman
VGen
114
21
0
19 Dec 2023
Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
Bingchen Zhao
Haoqin Tu
Chen Wei
Jieru Mei
Cihang Xie
114
36
0
18 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
M. Volkovs
123
3
0
15 Dec 2023
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu
Yi Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOS
VLM
111
41
0
14 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIP
VLM
78
12
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
57
4
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
101
15
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
86
23
0
13 Dec 2023
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
Shuyang Sun
Runjia Li
Philip Torr
Xiuye Gu
Siyang Li
VLM
CLIP
140
34
0
12 Dec 2023
Domain Prompt Learning with Quaternion Networks
Qinglong Cao
Zhengqin Xu
Yuntian Chen
Chao Ma
Xiaokang Yang
VLM
123
12
0
12 Dec 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
MLLM
97
133
0
11 Dec 2023
4M: Massively Multimodal Masked Modeling
David Mizrahi
Roman Bachmann
Ouguzhan Fatih Kar
Teresa Yeo
Mingfei Gao
Afshin Dehghan
Amir Zamir
MLLM
99
74
0
11 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
112
2
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
Shitian Zhao
Zhuowan Li
Yadong Lu
Alan Yuille
Yan Wang
LRM
73
9
0
09 Dec 2023
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Talfan Evans
Shreya Pathak
Hamza Merzic
Jonathan Schwarz
Ryutaro Tanno
Olivier J. Hénaff
80
17
0
08 Dec 2023
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making
Shusen Liu
Haichao Miao
Zhimin Li
M. Olson
Valerio Pascucci
P. Bremer
105
11
0
07 Dec 2023
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
Zirui Wang
Zhizhou Sha
Zheng Ding
Yilin Wang
Zhuowen Tu
DiffM
105
23
0
06 Dec 2023
Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey
Shengchao Chen
Guodong Long
Jing Jiang
Dikai Liu
Chengqi Zhang
SyDa
AI4CE
129
25
0
05 Dec 2023
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren
Zeyu Wang
Hongru Zhu
Junfei Xiao
Alan Yuille
Cihang Xie
VLM
116
8
0
04 Dec 2023
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Dixuan Lin
Yi-Xing Peng
Jingke Meng
Wei-Shi Zheng
84
6
0
04 Dec 2023
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Feng Wang
Jieru Mei
Alan Yuille
VLM
140
66
0
04 Dec 2023
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren
Zhicheng Huang
Yunchao Wei
Yao-Min Zhao
Dongmei Fu
Jiashi Feng
Xiaojie Jin
VLM
MLLM
LRM
116
109
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
108
22
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLM
VLM
85
11
0
03 Dec 2023
A Comprehensive Study of Vision Transformers in Image Classification Tasks
Mahmoud Khalil
Ahmad Khalil
A. Ngom
ViT
62
10
0
02 Dec 2023
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham
Felix Petersen
Vittorio Ferrari
Hilde Kuehne
ObjD
VLM
121
49
0
01 Dec 2023
Segment and Caption Anything
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLM
VLM
90
21
0
01 Dec 2023
Infrared Image Super-Resolution via GAN
Y. Huang
S. Omachi
GAN
76
0
0
01 Dec 2023
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
Ying Nie
Wei He
Kai Han
Yehui Tang
Tianyu Guo
Fanyi Du
Yunhe Wang
VLM
81
4
0
01 Dec 2023
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
81
1
0
01 Dec 2023
Green Edge AI: A Contemporary Survey
Yuyi Mao
X. Yu
Kaibin Huang
Ying-Jun Angela Zhang
Jun Zhang
127
21
0
01 Dec 2023
Brainformer: Mimic Human Visual Brain Functions to Machine Vision Models via fMRI
Xuan-Bac Nguyen
Xin Li
Pawan Sinha
Samee U. Khan
Khoa Luu
ViT
MedIm
94
0
0
30 Nov 2023
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
93
12
0
30 Nov 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLM
CoGe
MLLM
98
24
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
150
2
0
30 Nov 2023
GELDA: A generative language annotation framework to reveal visual biases in datasets
Krish Kabra
Kathleen M. Lewis
Guha Balakrishnan
VLM
44
1
0
29 Nov 2023
CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning
Xu Liu
Shu Zhou
Yurong Song
Wenzhe Luo
Xin Zhang
76
1
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
60
3
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
116
53
0
28 Nov 2023
The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation
Christel Chappuis
Eliot Walt
Vincent Mendez
Sylvain Lobry
B. L. Saux
D. Tuia
98
4
0
28 Nov 2023
Large Model Based Referring Camouflaged Object Detection
Shupeng Cheng
Ge-Peng Ji
Pengda Qin
Deng-Ping Fan
Bowen Zhou
Peng Xu
ObjD
62
8
0
28 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLM
VLM
136
8
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Alan Yuille
Jiahui Yu
VLM
54
3
0
27 Nov 2023
ViT-Lens: Towards Omni-modal Representations
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
97
20
0
27 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLM
ELM
VLM
352
960
0
27 Nov 2023
Efficient Pre-training for Localized Instruction Generation of Videos
Anil Batra
Davide Moltisanti
Laura Sevilla-Lara
Marcus Rohrbach
Frank Keller
87
0
0
27 Nov 2023
Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition
Yifei Chen
Dapeng Chen
Ruijin Liu
Sai Zhou
Wenyuan Xue
Wei Peng
57
6
0
27 Nov 2023
Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
Hoang-Quan Nguyen
Thanh-Dat Truong
Xuan-Bac Nguyen
Ashley Dowling
Xin Li
Khoa Luu
VLM
79
20
0
26 Nov 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann
Tim Dockhorn
Sumith Kulal
Daniel Mendelevitch
Maciej Kilian
...
Zion English
Vikram S. Voleti
Adam Letts
Varun Jampani
Robin Rombach
VGen
315
1,190
0
25 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
100
3
0
25 Nov 2023
Previous
1
2
3
...
8
9
10
...
17
18
19
Next