Learning inducing points and uncertainty on molecular data

Uncertainty control and scalability to large datasets are the two main issues for the deployment of Gaussian Process models into the autonomous material and chemical space exploration pipelines. One way to address both of these issues is by introducing the latent inducing variables and choosing the right approximation for the marginal log-likelihood objective. Here, we show that variational learning of the inducing points in the high-dimensional molecular descriptor space significantly improves both the prediction quality and uncertainty estimates on test configurations from a sample molecular dynamics dataset. Additionally, we show that inducing points can learn to represent the configurations of the molecules of different types that were not present within the initialization set of inducing points. Among several evaluated approximate marginal log-likelihood objectives, we show that the predictive log-likelihood provides both the predictive quality comparable to the exact Gaussian Process model and excellent uncertainty control. Finally, we comment on whether Gaussian Processes make predictions by interpolating the molecular configurations in high-dimensional descriptor space. We show that despite our intuition, even for densely sampled molecular datasets, most of the predictions are performed in the extrapolation regime.
View on arXiv