20
11

Estimating Differential Entropy under Gaussian Convolutions

Abstract

This paper studies the problem of estimating the differential entropy h(S+Z)h(S+Z), where SS and ZZ are independent dd-dimensional random variables with ZN(0,σ2Id)Z\sim\mathcal{N}(0,\sigma^2 \mathrm{I}_d). The distribution of SS is unknown, but nn independently and identically distributed (i.i.d) samples from it are available. The question is whether having access to samples of SS as opposed to samples of S+ZS+Z can improve estimation performance. We show that the answer is positive. More concretely, we first show that despite the regularizing effect of noise, the number of required samples still needs to scale exponentially in dd. This result is proven via a random-coding argument that reduces the question to estimating the Shannon entropy on a 2O(d)2^{O(d)}-sized alphabet. Next, for a fixed dd and nn large enough, it is shown that a simple plugin estimator, given by the differential entropy of the empirical distribution from SS convolved with the Gaussian density, achieves the loss of O((logn)d/4/n)O\left((\log n)^{d/4}/\sqrt{n}\right). Note that the plugin estimator amounts here to the differential entropy of a dd-dimensional Gaussian mixture, for which we propose an efficient Monte Carlo computation algorithm. At the same time, estimating h(S+Z)h(S+Z) via popular differential entropy estimators (based on kernel density estimation (KDE) or k nearest neighbors (kNN) techniques) applied to samples from S+ZS+Z would only attain much slower rates of order O(n1/d)O(n^{-1/d}), despite the smoothness of PS+ZP_{S+Z}. As an application, which was in fact our original motivation for the problem, we estimate information flows in deep neural networks and discuss Tishby's Information Bottleneck and the compression conjecture, among others.

View on arXiv
Comments on this paper