ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.15485
101
3

TULIP: Towards Unified Language-Image Pretraining

19 March 2025
Zineng Tang
Long Lian
Seun Eisape
Xudong Wang
Roei Herzig
Adam Yala
Alane Suhr
Trevor Darrell
David M. Chan
    VLM
    CLIP
    MLLM
ArXivPDFHTML
Abstract

Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a 2×2\times2× enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over 3×3\times3× higher scores than SigLIP on MMVP. Our code/checkpoints are available atthis https URL

View on arXiv
@article{tang2025_2503.15485,
  title={ TULIP: Towards Unified Language-Image Pretraining },
  author={ Zineng Tang and Long Lian and Seun Eisape and XuDong Wang and Roei Herzig and Adam Yala and Alane Suhr and Trevor Darrell and David M. Chan },
  journal={arXiv preprint arXiv:2503.15485},
  year={ 2025 }
}
Comments on this paper