Factor selection by permutation

2 October 2017

Abstract

Researchers often have data measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many factors affect the data? Many approaches have been developed for this factor selection problem. The popular Parallel Analysis method randomly permutes each feature of the data. It selects factors if their singular values are larger than those of the permuted data. It is used by leading applied statisticians, including T Hastie, M Stephens, J Storey, R Tibshirani and WH Wong. Despite empirical evidence for its accuracy, there is currently no theoretical justification. This prevents us from knowing when it will work in the future. In this paper, we show that parallel analysis consistently selects the significant factors in certain high-dimensional factor models. The intuition is that permutations keep the noise invariant, while "destroying" the low-rank signal. This provides justification for permutation methods in PCA and factor models under some conditions. A key requirement is that the factors must load on several variables. Our work points to improvements of permutation methods.

View on arXiv

Comments on this paper