Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an -differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of , where is the maximum length of a document in the database, is the number of documents, and is the size of the alphabet. We show that this is optimal up to a factor. Further, we show that for -differential privacy, the bound for document counting can be improved to . Additionally, our data structures are efficient. In particular, our data structures use space, preprocessing time, and query time where is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and -grams. For -grams, we further improve the preprocessing time of the data structure.
View on arXiv