109
6

Rethinking Gauss-Newton for learning over-parameterized models

Abstract

This work studies the global convergence and generalization properties of Gauss Newton's (GN) when optimizing one-hidden layer networks in the over-parameterized regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. We find that, while GN is consistently faster than GD in finding a global optimum, the performance of the learned model on a test dataset is heavily influenced by both the learning rate and the variance of the randomly initialized network's weights. Specifically, we find that initializing with a smaller variance results in a better generalization, a behavior also observed for GD. However, in contrast to GD where larger learning rates lead to the best generalization, we find that GN achieves an improved generalization when using smaller learning rates, albeit at the cost of slower convergence. This study emphasizes the significance of the learning rate in balancing the optimization speed of GN with the generalization ability of the learned solution.

View on arXiv
Comments on this paper