Randomized incomplete -statistics in high dimensions

This paper studies inference for the mean vector of a high-dimensional -statistic. In the era of Big Data, the dimension of the -statistic and the sample size of the observations tend to be both large, and the computation of the -statistic is prohibitively demanding. Data-dependent inferential procedures such as the empirical bootstrap for -statistics is even more computationally expensive. To overcome such computational bottleneck, incomplete -statistics obtained by sampling fewer terms of the -statistic are attractive alternatives. In this paper, we introduce randomized incomplete -statistics with sparse weights whose computational cost can be made independent of the order of the -statistic. We derive non-asymptotic Gaussian approximation error bounds for the randomized incomplete -statistics in high dimensions, namely in cases where the dimension is possibly much larger than the sample size , for both non-degenerate and degenerate kernels. In addition, we propose generic bootstrap methods for the incomplete -statistics that are computationally much less-demanding than existing bootstrap methods, and establish finite sample validity of the proposed bootstrap methods. Our methods are illustrated on the application to nonparametric testing for the pairwise independence of a high-dimensional random vector under weaker assumptions than those appearing in the literature.
View on arXiv