v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 935 papers shown

Title
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation Tongtian Yue Longteng Guo Yepeng Tang Zijia Zhao Xinxin Zhu Hua Huang Jing Liu MLLM VLM 16 0 0 20 Jun 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning Ankan Deria Adinath Madhavrao Dukre Feilong Tang Sara Atito Sudipta Roy Muhammad Awais Muhammad Haris Khan Imran Razzak VLM 40 0 0 18 Jun 2025
Interpretable Text-Guided Image Clustering via Iterative Search Bingchen Zhao Oisin Mac Aodha 33 0 0 14 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs Xiao Xu L. Qin Wanxiang Che Min-Yen Kan MoE VLM 30 0 0 13 Jun 2025
Attention, Please! Revisiting Attentive Probing for Masked Image Modeling Bill Psomas Dionysis Christopoulos Eirini Baltzi Ioannis Kakogeorgiou Tilemachos Aravanis N. Komodakis Konstantinos Karantzalos Yannis Avrithis Giorgos Tolias 56 0 0 11 Jun 2025
Canonical Latent Representations in Conditional Diffusion Models Yitao Xu Tong Zhang Ehsan Pajouheshgar Sabine Süsstrunk DiffM 77 0 0 11 Jun 2025
Fusing Cross-modal and Uni-modal Representations: A Kronecker Product Approach Youqi Wu Jingwei Zhang Farzan Farnia 23 0 0 10 Jun 2025
SensorLM: Learning the Language of Wearable Sensors Yuwei Zhang Kumar Ayush Siyuan Qiao A. Heydari Girish Narayanswamy ... Shwetak N. Patel Cecilia Mascolo Xin Liu Daniel J. McDuff Yuzhe Yang 51 0 0 10 Jun 2025
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning Tianyi Bai Yuxuan Fan Jiantao Qiu Fupeng Sun Jiayi Song Junlin Han Zichen Liu Conghui He Wentao Zhang Binhang Yuan MLLM VLM 23 0 0 08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits Divya J. Bajpai M. Hanawal VLM 15 0 0 07 Jun 2025
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance Huy Le Nhat Chung Tung Kieu A. Nguyen Ngan Le 70 0 0 04 Jun 2025
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning Amit Peleg Naman D. Singh Matthias Hein CoGe VLM 32 0 0 30 May 2025
A Mathematical Perspective On Contrastive Learning Ricardo Baptista Andrew Stuart S. D. Tran 20 0 0 30 May 2025
From Theory to Application: Fine-Tuning Large EEG Model with Real-World Stress Data Siwen Wang Shitou Zhang Wan-Lin Chen Dung Truong Tzyy-Ping Jung 36 0 0 29 May 2025
Revisiting Bayesian Model Averaging in the Era of Foundation Models Mijung Park UQCV MoMe 17 0 0 28 May 2025
Vision Transformers with Self-Distilled Registers Yinjie Chen Zipeng Yan Chong Zhou Bo Dai Andrew F. Luo 54 0 0 27 May 2025
Visualized Text-to-Image Retrieval Di Wu Yixin Wan Kai-Wei Chang 47 1 0 26 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning Minheng Ni Zhengyuan Yang Linjie Li Chung-Ching Lin Kevin Qinghong Lin W. Zuo Lijuan Wang ReLM LRM 85 1 0 26 May 2025
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models Mobina Mansoori Sajjad Shahabodini Farnoush Bayatmakou J. Abouei Konstantinos N. Plataniotis Arash Mohammadi 41 0 0 26 May 2025
Progressive Scaling Visual Object Tracking Jack Hong Shilin Yan Zehao Xiao Jiayin Cai Xiaolong Jiang Yao Hu Henghui Ding 77 0 0 26 May 2025
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology Jiabo Ma Yingxue Xu Fengtao Zhou Y. X. R. Wang Cheng Jin ... Xiuming Zhang Li Liang R. Chan Zhe Wang H. Chen LM&MA VLM 49 0 0 26 May 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation Daniel Csizmadia Andrei Codreanu Victor Sim Vighnesh Prabhu Michael Lu Kevin Zhu Sean O'Brien Vasu Sharma CLIP VLM 71 0 0 25 May 2025
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering Y. Chen Wenjie Xiao P. R. Bassi Xinze Zhou Sezgin Er Ibrahim Ethem Hamamci Zongwei Zhou Alan Yuille ELM 57 0 0 25 May 2025
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data Dong-Hee Kim Hyunjee Song Donghyun Kim 290 0 0 23 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts Taewon Kang Ming C. Lin DiffM VGen 83 0 0 22 May 2025
Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text Kun-Yu Lin Hongjun Wang Weining Ren Kai Han 291 0 0 22 May 2025
Aligning Explanations with Human Communication Jacopo Teneggi Zhenzhen Wang Paul H. Yi Tianmin Shu Jeremias Sulam 173 0 0 21 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning Lihong Chen Hossein Hassani Soodeh Nikan VLM 104 0 0 19 May 2025
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment Siming Sun Kai Zhang Xuejun Jiang Wenchao Meng Qinmin Yang AI4TS 58 0 0 19 May 2025
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Subash Khanal Srikumar Sastry Aayush Dhakal Adeel Ahmad Nathan Jacobs 76 0 0 19 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning Shibin Mei Hang Wang Bingbing Ni 74 0 0 16 May 2025
Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning Sriram Mandalika VLM 61 0 0 16 May 2025
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment Siyuan Yan Xiaochen Li Ming Hu Yiwen Jiang Zhen Yu Zongyuan Ge MedIm VLM 79 0 0 14 May 2025
$Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization$ Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ ual- $\mathbf{\texttt{H}}$ ead $\mathbf{\texttt{O}}$ ptimization Seongjae Kang Dong Bok Lee Hyungjoon Jang Sung Ju Hwang VLM 101 0 0 12 May 2025
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning H. M. D. Kabir S. Mondal Mohammad Ali Moni 35 0 0 10 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP Hanxun Huang Sarah Monazam Erfani Yige Li Xingjun Ma James Bailey AAML 155 1 0 08 May 2025
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning Enhao Zhang Chaohua Li Chuanxing Geng Songcan Chen 161 0 0 08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Xianhang Li Yixiao Liu Haoqin Tu Hongru Zhu Cihang Xie VLM 440 2 0 07 May 2025
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction Muhammad Haris Khan Miguel Altamirano Cabrera Dmitrii Iarchuk Yara Mahmoud Daria Trinitatova Issatay Tokmurziyev Dzmitry Tsetserukou VLM 84 0 0 05 May 2025
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models Chaomeng Chen Zitong Yu Jin Song Dong Sen Su Linlin Shen Shutao Xia Xiaochun Cao FedML VLM 455 0 0 03 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts Wenfa Wu Guanyu Zhang Zheng Tan Yi Wang Hongsheng Qi AI4TS 106 2 0 02 May 2025
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling Hyun Lee Chris Yi Maminur Islam B.D.S. Aritra 72 0 0 02 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models Wufei Ma Luoxin Ye Nessa McWeeney Celso M de Melo Jieneng Chen LRM 118 1 0 01 May 2025
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design Vasudev Sharma Ahmed Alagha Abdelhakim Khellaf Vincent Quoc-Huy Trinh Mahdi S. Hosseini 143 0 0 30 Apr 2025
Bayesian Principles Improve Prompt Learning In Vision-Language Models Mingyu Kim Jongwoo Ko Mijung Park VLM 119 0 0 19 Apr 2025
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization Hongwei Ji Wulian Yun Mengshi Qi Huadong Ma LRM 445 0 0 18 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network Daniel Bolya Po-Yao (Bernie) Huang Peize Sun Jang Hyun Cho Andrea Madotto ... Shiyu Dong Nikhila Ravi Daniel Li Piotr Dollár Christoph Feichtenhofer ObjD VOS 329 9 0 17 Apr 2025
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification Md. Sanaullah Chowdhury Lameya Sabrin VLM 59 0 0 17 Apr 2025
Can Masked Autoencoders Also Listen to Birds? Lukas Rauch Ilyass Moummad René Heinrich Alexis Joly Bernhard Sick Christoph Scholz 151 0 0 17 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer Weixian Lei Jiacong Wang Haochen Wang Xuelong Li Jun Hao Liew Jiashi Feng Zilong Huang 74 5 0 14 Apr 2025