13
59

Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications

T. Tony Cai
Zijian Guo
Abstract

This paper considers statistical inference for the explained variance βΣβ\beta^{\intercal}\Sigma \beta under the high-dimensional linear model Y=Xβ+ϵY=X\beta+\epsilon in the semi-supervised setting, where β\beta is the regression vector and Σ\Sigma is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax optimal rate of convergence in the general semi-supervised framework. The optimality result characterizes how the unlabelled data contributes to the estimation accuracy. Moreover, the limiting distribution for the proposed estimator is established and the unlabelled data has also proven useful in reducing the length of the confidence interval for the explained variance. The proposed method is extended to the semi-supervised inference for the unweighted quadratic functional, β22\|\beta\|_2^2. The obtained inference results are then applied to a range of high-dimensional statistical problems, including signal detection and global testing, prediction accuracy evaluation, and confidence ball construction. The numerical improvement of incorporating the unlabelled data is demonstrated through simulation studies and an analysis of estimating heritability for a yeast segregant data set with multiple traits.

View on arXiv
Comments on this paper