ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.14108
25
408

DataComp: In search of the next generation of multimodal datasets

27 April 2023
S. Gadre
Gabriel Ilharco
Alex Fang
J. Hayase
Georgios Smyrnis
Thao Nguyen
Ryan Marten
Mitchell Wortsman
Dhruba Ghosh
Jieyu Zhang
Eyal Orgad
R. Entezari
Giannis Daras
Sarah M Pratt
Vivek Ramanujan
Yonatan Bitton
Kalyani Marathe
Stephen Mussmann
Richard Vencu
Mehdi Cherti
Ranjay Krishna
Pang Wei Koh
O. Saukh
Alexander Ratner
Shuran Song
Hannaneh Hajishirzi
Ali Farhadi
Romain Beaumont
Sewoong Oh
A. Dimakis
J. Jitsev
Y. Carmon
Vaishaal Shankar
Ludwig Schmidt
    VLM
ArXivPDFHTML
Abstract

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

View on arXiv
Comments on this paper