Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix

Abstract
We introduce a novel method to compute a rank approximation of the inverse of the Hessian matrix in the distributed regime. By leveraging the differences in gradients and parameters of multiple Workers, we are able to efficiently implement a distributed approximation of the Newton-Raphson method. We also present preliminary results which underline advantages and challenges of second-order methods for large stochastic optimization problems. In particular, our work suggests that novel strategies for combining gradients provide further information on the loss surface.
View on arXivComments on this paper