ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.02958
44
8

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

5 June 2024
Charlie Hou
Akshat Shrivastava
Hongyuan Zhan
Rylan Conway
Trang Le
Adithya Sagar
Giulia Fanti
Daniel Lazar
ArXivPDFHTML
Abstract

On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes (ϵ=1.29\epsilon=1.29ϵ=1.29, ϵ=7.58\epsilon=7.58ϵ=7.58). We achieve these results while using 9×\times× fewer rounds, 6×\times× less client computation per round, and 100×\times× less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

View on arXiv
Comments on this paper