We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are missing not at random (MNAR). The missingness model, denoted by , is the function that maps any point in to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings:(i) Self-censoring: An observation is generated by first sampling the true value from a -dimensional Gaussian with unknown and . For each coordinate , there exists a set subseteq such that if and only if in . Otherwise, is missing and takes a generic value (e.g., "?"). We design an algorithm that learns up to total variation (TV) distance epsilon, using samples, assuming only that each pair of coordinates is observed with sufficiently high probability.(ii) Linear thresholding: An observation is generated by first sampling from a -dimensional Gaussian with unknown and known , and then applying the missingness model where for some in and in . We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability.
View on arXiv@article{bhattacharyya2025_2504.19446, title={ Learning High-dimensional Gaussians from Censored Data }, author={ Arnab Bhattacharyya and Constantinos Daskalakis and Themis Gouleakis and Yuhao Wang }, journal={arXiv preprint arXiv:2504.19446}, year={ 2025 } }