ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.14677
44
0

Data-Constrained Synthesis of Training Data for De-Identification

24 February 2025
Thomas Vakili
Aron Henriksson
Hercules Dalianis
    SyDa
ArXivPDFHTML
Abstract

Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

View on arXiv
@article{vakili2025_2502.14677,
  title={ Data-Constrained Synthesis of Training Data for De-Identification },
  author={ Thomas Vakili and Aron Henriksson and Hercules Dalianis },
  journal={arXiv preprint arXiv:2502.14677},
  year={ 2025 }
}
Comments on this paper