A Diffusion Theory For Minima Selection: Stochastic Gradient Descent
Exponentially Favors Flat Minima
- ODL
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. However, the quantitative theory behind stochastic gradients still remains to be further investigated. In this paper, we focus on a fundamental problem in deep learning, "How can deep learning select flat minima among so many minima?" To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on minima sharpness, gradient noise and hyperparameters. We discover that stochastic gradient noise can accelerate escaping sharp minima exponentially in terms of the eigenvalues of minima hessians. Thus, SGD exponentially favors flat minima more than sharp minima, while Stochastic Gradient Langevin Dynamics (SGLD) only polynomially favors flat minima. We also prove that either small-learning-rate or large-batch training requires exponentially many iterations to escape minima in terms of ratio of batch size and learning rate, and thus can't search flat minima well given a realistic computational time.
View on arXiv