Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 935 papers shown
Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
16
0
0
20 Jun 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria
Adinath Madhavrao Dukre
Feilong Tang
Sara Atito
Sudipta Roy
Muhammad Awais
Muhammad Haris Khan
Imran Razzak
VLM
40
0
0
18 Jun 2025
Interpretable Text-Guided Image Clustering via Iterative Search
Bingchen Zhao
Oisin Mac Aodha
33
0
0
14 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
30
0
0
13 Jun 2025
Attention, Please! Revisiting Attentive Probing for Masked Image Modeling
Bill Psomas
Dionysis Christopoulos
Eirini Baltzi
Ioannis Kakogeorgiou
Tilemachos Aravanis
N. Komodakis
Konstantinos Karantzalos
Yannis Avrithis
Giorgos Tolias
56
0
0
11 Jun 2025
Canonical Latent Representations in Conditional Diffusion Models
Yitao Xu
Tong Zhang
Ehsan Pajouheshgar
Sabine Süsstrunk
DiffM
77
0
0
11 Jun 2025
Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach
Youqi Wu
Jingwei Zhang
Farzan Farnia
23
0
0
10 Jun 2025
SensorLM: Learning the Language of Wearable Sensors
Yuwei Zhang
Kumar Ayush
Siyuan Qiao
A. Heydari
Girish Narayanswamy
...
Shwetak N. Patel
Cecilia Mascolo
Xin Liu
Daniel J. McDuff
Yuzhe Yang
51
0
0
10 Jun 2025
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Tianyi Bai
Yuxuan Fan
Jiantao Qiu
Fupeng Sun
Jiayi Song
Junlin Han
Zichen Liu
Conghui He
Wentao Zhang
Binhang Yuan
MLLM
VLM
23
0
0
08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
Divya J. Bajpai
M. Hanawal
VLM
15
0
0
07 Jun 2025
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
Huy Le
Nhat Chung
Tung Kieu
A. Nguyen
Ngan Le
70
0
0
04 Jun 2025
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg
Naman D. Singh
Matthias Hein
CoGe
VLM
32
0
0
30 May 2025
A Mathematical Perspective On Contrastive Learning
Ricardo Baptista
Andrew Stuart
S. D. Tran
20
0
0
30 May 2025
From Theory to Application: Fine-Tuning Large EEG Model with Real-World Stress Data
Siwen Wang
Shitou Zhang
Wan-Lin Chen
Dung Truong
Tzyy-Ping Jung
36
0
0
29 May 2025
Revisiting Bayesian Model Averaging in the Era of Foundation Models
Mijung Park
UQCV
MoMe
17
0
0
28 May 2025
Vision Transformers with Self-Distilled Registers
Yinjie Chen
Zipeng Yan
Chong Zhou
Bo Dai
Andrew F. Luo
54
0
0
27 May 2025
Visualized Text-to-Image Retrieval
Di Wu
Yixin Wan
Kai-Wei Chang
47
1
0
26 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLM
LRM
85
1
0
26 May 2025
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models
Mobina Mansoori
Sajjad Shahabodini
Farnoush Bayatmakou
J. Abouei
Konstantinos N. Plataniotis
Arash Mohammadi
41
0
0
26 May 2025
Progressive Scaling Visual Object Tracking
Jack Hong
Shilin Yan
Zehao Xiao
Jiayin Cai
Xiaolong Jiang
Yao Hu
Henghui Ding
77
0
0
26 May 2025
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology
Jiabo Ma
Yingxue Xu
Fengtao Zhou
Y. X. R. Wang
Cheng Jin
...
Xiuming Zhang
Li Liang
R. Chan
Zhe Wang
H. Chen
LM&MA
VLM
49
0
0
26 May 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Daniel Csizmadia
Andrei Codreanu
Victor Sim
Vighnesh Prabhu
Michael Lu
Kevin Zhu
Sean O'Brien
Vasu Sharma
CLIP
VLM
71
0
0
25 May 2025
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Y. Chen
Wenjie Xiao
P. R. Bassi
Xinze Zhou
Sezgin Er
Ibrahim Ethem Hamamci
Zongwei Zhou
Alan Yuille
ELM
57
0
0
25 May 2025
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data
Dong-Hee Kim
Hyunjee Song
Donghyun Kim
290
0
0
23 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Taewon Kang
Ming C. Lin
DiffM
VGen
83
0
0
22 May 2025
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text
Kun-Yu Lin
Hongjun Wang
Weining Ren
Kai Han
291
0
0
22 May 2025
Aligning Explanations with Human Communication
Jacopo Teneggi
Zhenzhen Wang
Paul H. Yi
Tianmin Shu
Jeremias Sulam
173
0
0
21 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
104
0
0
19 May 2025
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment
Siming Sun
Kai Zhang
Xuejun Jiang
Wenchao Meng
Qinmin Yang
AI4TS
58
0
0
19 May 2025
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal
Srikumar Sastry
Aayush Dhakal
Adeel Ahmad
Nathan Jacobs
76
0
0
19 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Shibin Mei
Hang Wang
Bingbing Ni
74
0
0
16 May 2025
Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning
Sriram Mandalika
VLM
61
0
0
16 May 2025
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment
Siyuan Yan
Xiaochen Li
Ming Hu
Yiwen Jiang
Zhen Yu
Zongyuan Ge
MedIm
VLM
79
0
0
14 May 2025
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via
D
\mathbf{\texttt{D}}
D
ual-
H
\mathbf{\texttt{H}}
H
ead
O
\mathbf{\texttt{O}}
O
ptimization
Seongjae Kang
Dong Bok Lee
Hyungjoon Jang
Sung Ju Hwang
VLM
101
0
0
12 May 2025
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning
H. M. D. Kabir
S. Mondal
Mohammad Ali Moni
35
0
0
10 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
155
1
0
08 May 2025
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
Enhao Zhang
Chaohua Li
Chuanxing Geng
Songcan Chen
161
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Yixiao Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
440
2
0
07 May 2025
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
Muhammad Haris Khan
Miguel Altamirano Cabrera
Dmitrii Iarchuk
Yara Mahmoud
Daria Trinitatova
Issatay Tokmurziyev
Dzmitry Tsetserukou
VLM
84
0
0
05 May 2025
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models
Chaomeng Chen
Zitong Yu
Jin Song Dong
Sen Su
Linlin Shen
Shutao Xia
Xiaochun Cao
FedML
VLM
455
0
0
03 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Wenfa Wu
Guanyu Zhang
Zheng Tan
Yi Wang
Hongsheng Qi
AI4TS
106
2
0
02 May 2025
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
Hyun Lee
Chris Yi
Maminur Islam
B.D.S. Aritra
72
0
0
02 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
Jieneng Chen
LRM
118
1
0
01 May 2025
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Vasudev Sharma
Ahmed Alagha
Abdelhakim Khellaf
Vincent Quoc-Huy Trinh
Mahdi S. Hosseini
143
0
0
30 Apr 2025
Bayesian Principles Improve Prompt Learning In Vision-Language Models
Mingyu Kim
Jongwoo Ko
Mijung Park
VLM
119
0
0
19 Apr 2025
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization
Hongwei Ji
Wulian Yun
Mengshi Qi
Huadong Ma
LRM
445
0
0
18 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
329
9
0
17 Apr 2025
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
Md. Sanaullah Chowdhury Lameya Sabrin
VLM
59
0
0
17 Apr 2025
Can Masked Autoencoders Also Listen to Birds?
Lukas Rauch
Ilyass Moummad
René Heinrich
Alexis Joly
Bernhard Sick
Christoph Scholz
151
0
0
17 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
Xuelong Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
74
5
0
14 Apr 2025
1
2
3
4
...
17
18
19
Next