16
5

On the number of variables to use in principal component regression

Abstract

We study least squares linear regression over NN uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features pp is at most the sample size nn, the estimator under consideration coincides with the principal component regression estimator; when p>np>n, the estimator is the least 2\ell_2 norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as p,n,Np,n,N \to \infty with p/Nαp/N \to \alpha and n/Nβn/N \to \beta, for some constants α[0,1]\alpha \in [0,1] and β(0,1)\beta \in (0,1). In this average-case setting, the prediction error exhibits a "double descent" shape as a function of pp. We also establish conditions under which the minimum risk is achieved in the interpolating (p>np>n) regime.

View on arXiv
Comments on this paper