111
0

Differentially Private Substring and Document Counting

Abstract

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an ϵ\epsilon-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of O(polylog(nΣ))O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|)), where \ell is the maximum length of a document in the database, nn is the number of documents, and Σ|\Sigma| is the size of the alphabet. We show that this is optimal up to a O(polylog(n))O(\mathrm{polylog}(n\ell)) factor. Further, we show that for (ϵ,δ)(\epsilon,\delta)-differential privacy, the bound for document counting can be improved to O(polylog(nΣ))O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|)). Additionally, our data structures are efficient. In particular, our data structures use O(n2)O(n\ell^2) space, O(n24)O(n^2\ell^4) preprocessing time, and O(P)O(|P|) query time where PP is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and qq-grams. For qq-grams, we further improve the preprocessing time of the data structure.

View on arXiv
Comments on this paper