Estimating false inclusion rates in penalized regression models

19 July 2016

Abstract

Penalized regression methods are an attractive tool for feature selection with many appealing properties, although their widespread adoption has been hampered by the difficulty of applying inferential tools. In particular, the question "How reliable is the selection of those features?" has proved difficult to address, partially due to the complexity of defining a false discovery in the penalized regression setting. Here, I define a false inclusion as a variable that is independent of the outcome regardless of whether other variables are conditioned on. This definition permits straightforward estimation of the number of false inclusions. Theoretical analysis and simulation studies demonstrate that this approach is quite accurate when the correlation among predictors is mild, and slightly conservative when the correlation is moderate. Finally, the practical utility of the proposed method is illustrated using gene expression data from The Cancer Genome Atlas and GWAS data from the Myocardial Applied Genomics Network.

View on arXiv

Comments on this paper