ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.14377
83
0

Advancing Medical Representation Learning Through High-Quality Data

18 March 2025
Negin Baghbanzadeh
Adibvafa Fallahpour
Yasaman Parhizkar
Franklin Ogidi
Shuvendu Roy
Sajad Ashkezari
Vahid Reza Khazaie
Michael Colacci
Ali Etemad
Arash Afkanpour
Elham Dolatabadi
    LM&MA
ArXivPDFHTML
Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

View on arXiv
@article{baghbanzadeh2025_2503.14377,
  title={ Advancing Medical Representation Learning Through High-Quality Data },
  author={ Negin Baghbanzadeh and Adibvafa Fallahpour and Yasaman Parhizkar and Franklin Ogidi and Shuvendu Roy and Sajad Ashkezari and Vahid Reza Khazaie and Michael Colacci and Ali Etemad and Arash Afkanpour and Elham Dolatabadi },
  journal={arXiv preprint arXiv:2503.14377},
  year={ 2025 }
}
Comments on this paper