The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.
View on arXiv@article{ghassabi2025_2505.16000, title={ Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model }, author={ Mehrdad Ghassabi and Pedram Rostami and Hamidreza Baradaran Kashani and Amirhossein Poursina and Zahra Kazemi and Milad Tavakoli }, journal={arXiv preprint arXiv:2505.16000}, year={ 2025 } }