Solar: a least-angle regression for stable variable selection in
high-dimensional spaces
We propose a new algorithm for variable selection in high-dimensional data, called subsample-ordered least-angle regression (solar). Solar relies on the average solution path computed across subsamples and alleviates several known high-dimensional issues with lasso and least-angle regression. We illustrate in simulations that, with the same computation load, solar yields substantial improvements over lasso in terms of the sparsity (37-64\% reduction in the average number of selected variables), stability and accuracy of variable selection. Moreover, solar supplemented with the hold-out average (an adaptation of classical post-OLS tests) successfully purges almost all of the redundant variables while retaining all of the informative variables. Using simulations and real-world data, we also illustrate numerically that sparse solar variable selection is robust to complicated dependence structures and harsh settings of the irrepresentable condition. Moreover, replacing lasso with solar in an ensemble system (e.g., the bootstrap ensemble), significantly reduces the computation load (at least 96\% fewer subsample repetitions) of the bootstrap ensemble and improves selection sparsity. We provide a Python parallel computing package for solar (solarpy) in the supplementary file and https://github.com/isaac2math/solar.
View on arXiv