ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.07765
16
20

Spanish Biomedical Crawled Corpus: A Large, Diverse Dataset for Spanish Biomedical Language Models

16 September 2021
C. Carrino
Jordi Armengol-Estapé
Ona de Gibert Bonet
Asier Gutiérrez-Fandiño
Aitor Gonzalez-Agirre
Martin Krallinger
Marta Villegas
ArXivPDFHTML
Abstract

We introduce CoWeSe (the Corpus Web Salud Espa\~nol), the largest Spanish biomedical corpus to date, consisting of 4.5GB (about 750M tokens) of clean plain text. CoWeSe is the result of a massive crawler on 3000 Spanish domains executed in 2020. The corpus is openly available and already preprocessed. CoWeSe is an important resource for biomedical and health NLP in Spanish and has already been employed to train domain-specific language models and to produce word embbedings. We released the CoWeSe corpus under a Creative Commons Attribution 4.0 International license, both in Zenodo (\url{https://zenodo.org/record/4561971\#.YTI5SnVKiEA}).

View on arXiv
Comments on this paper