Differentially Private Substring and Document Counting

18 December 2024

Giulia Bernardini

Main:28 Pages

Bibliography:5 Pages

Abstract

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an $\epsilon$ -differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of $O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|))$ , where $\ell$ is the maximum length of a document in the database, $n$ is the number of documents, and $|\Sigma|$ is the size of the alphabet. We show that this is optimal up to a $O(\mathrm{polylog}(n\ell))$ factor. Further, we show that for $(\epsilon,\delta)$ -differential privacy, the bound for document counting can be improved to $O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|))$ . Additionally, our data structures are efficient. In particular, our data structures use $O(n\ell^2)$ space, $O(n^2\ell^4)$ preprocessing time, and $O(|P|)$ query time where $P$ is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and $q$ -grams. For $q$ -grams, we further improve the preprocessing time of the data structure.

View on arXiv

Comments on this paper