16
56

On the Local Minima of the Empirical Risk

Abstract

Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function FF (population risk) given only access to an approximation ff (empirical risk) that is pointwise close to FF (i.e., Ffν\|F-f\|_{\infty} \le \nu). Our objective is to find the ϵ\epsilon-approximate local minima of the underlying function FF while avoiding the shallow local minima---arising because of the tolerance ν\nu---which exist only in ff. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of ff that is guaranteed to achieve our goal as long as νO(ϵ1.5/d)\nu \le O(\epsilon^{1.5}/d). We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance ν\nu among all algorithms making a polynomial number of queries of ff. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.

View on arXiv
Comments on this paper