A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient
Descent Escapes From Sharp Minima Exponentially Fast
- ODL
Stochastic optimization algorithms, such as Stochastic Gradient Descent (SGD) and its variants, are mainstream methods for training deep networks in practice. However, the theoretical mechanism behind stochastic gradient noise still remains to be further investigated. Deep learning is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. In this paper, we focus on a fundamental problem in deep learning, ``How can deep learning usually find flat minima among so many minima?'' To answer the question, we develop a density diffusion theory (DDT) for the minima transition mechanism of SGD. More specifically, we study how minima transition depends on minima sharpness, gradient noise and hyperparameters. One of the most interesting findings is that stochastic gradient noise from SGD can accelerate escaping from sharp minima exponentially faster than flat minima, while white noise can only help escape from sharp minima polynomially faster than flat minima. We also find large-batch training requires exponentially many iterations to pass through sharp minima and find flat minima. We present direct empirical evidence supporting the proposed theoretical results.
View on arXiv