11
10

Statistically Optimal First Order Algorithms: A Proof via Orthogonalization

Abstract

We consider a class of statistical estimation problems in which we are given a random data matrix XRn×d{\boldsymbol X}\in {\mathbb R}^{n\times d} (and possibly some labels yRn{\boldsymbol y}\in{\mathbb R}^n) and would like to estimate a coefficient vector θRd{\boldsymbol \theta}\in{\mathbb R}^d (or possibly a constant number of such vectors). Special cases include low-rank matrix estimation and regularized estimation in generalized linear models (e.g., sparse regression). First order methods proceed by iteratively multiplying current estimates by X{\boldsymbol X} or its transpose. Examples include gradient descent or its accelerated variants. Celentano, Montanari, Wu proved that for any constant number of iterations (matrix vector multiplications), the optimal first order algorithm is a specific approximate message passing algorithm (known as `Bayes AMP'). The error of this estimator can be characterized in the high-dimensional asymptotics n,dn,d\to\infty, n/dδn/d\to\delta, and provides a lower bound to the estimation error of any first order algorithm. Here we present a simpler proof of the same result, and generalize it to broader classes of data distributions and of first order algorithms, including algorithms with non-separable nonlinearities. Most importantly, the new proof technique does not require to construct an equivalent tree-structured estimation problem, and is therefore susceptible of a broader range of applications.

View on arXiv
Comments on this paper