8
0

Natural Language Processing tools for Pharmaceutical Manufacturing Information Extraction from Patents Natural Language Processing (NLP) tools for Pharmaceutical Manufacturing Information Extraction from Patents

Diego Alvarado-Maldonado
Blair Johnston
Cameron J. Brown
Abstract

Abundant and diverse data on medicines manufacturing and other lifecycle components has been made easily accessible in the last decades. However, a significant proportion of this information is characterised by not being tabulated and usable for machine learning purposes. Thus, natural language processing tools have been used to build databases in domains such as biomedical and chemical to address this limitation. This has allowed the development of artificial intelligence applications, which have improved drug discovery and treatments. In the pharmaceutical manufacturing context, some initiatives and datasets for primary processing can be found, but the manufacturing of drug products is an area which is still lacking, to the best of our knowledge. This works aims to explore and adapt NLP tools used in other domains to extract information on both primary and secondary manufacturing, employing patents as the main source of data. Thus, two independent, but complementary, models were developed comprising a method to select fragments of text that contain manufacturing data, and a named entity recognition system that enables extracting information on operations, materials, and conditions of a process. For the first model, the identification of relevant sections was achieved using an unsupervised approach combining Latent Dirichlet Allocation and k-Means clustering. The performance of this model measured as a Cohen's kappa between model output and manual revision was higher than 90%. NER model consisted of a deep neural network, and an f1-score micro average of 84.2% was obtained which is comparable to other works. Some considerations for these tools to be used in data extraction are discussed throughout this document.

View on arXiv
@article{alvarado-maldonado2025_2504.20598,
  title={ Natural Language Processing tools for Pharmaceutical Manufacturing Information Extraction from Patents Natural Language Processing (NLP) tools for Pharmaceutical Manufacturing Information Extraction from Patents },
  author={ Diego Alvarado-Maldonado and Blair Johnston and Cameron J. Brown },
  journal={arXiv preprint arXiv:2504.20598},
  year={ 2025 }
}
Comments on this paper