ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.13813
62
0

Differentially Private Substring and Document Counting

18 December 2024
Giulia Bernardini
Philip Bille
Inge Li Gørtz
Teresa Anna Steiner
ArXivPDFHTML
Abstract

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an ϵ\epsilonϵ-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of O(ℓ⋅polylog(nℓ∣Σ∣))O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|))O(ℓ⋅polylog(nℓ∣Σ∣)), where ℓ\ellℓ is the maximum length of a document in the database, nnn is the number of documents, and ∣Σ∣|\Sigma|∣Σ∣ is the size of the alphabet. We show that this is optimal up to a O(polylog(nℓ))O(\mathrm{polylog}(n\ell))O(polylog(nℓ)) factor. Further, we show that for (ϵ,δ)(\epsilon,\delta)(ϵ,δ)-differential privacy, the bound for document counting can be improved to O(ℓ⋅polylog(nℓ∣Σ∣))O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|))O(ℓ​⋅polylog(nℓ∣Σ∣)). Additionally, our data structures are efficient. In particular, our data structures use O(nℓ2)O(n\ell^2)O(nℓ2) space, O(n2ℓ4)O(n^2\ell^4)O(n2ℓ4) preprocessing time, and O(∣P∣)O(|P|)O(∣P∣) query time where PPP is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and qqq-grams. For qqq-grams, we further improve the preprocessing time of the data structure.

View on arXiv
Comments on this paper