Unsupervised Ground Metric Learning using Wasserstein Singular Vectors
- SSLOT

Defining meaningful distances between samples, which are columns in a data matrix, is a fundamental problem in machine learning. Optimal Transport (OT) defines geometrically meaningful "Wasserstein" distances between probability distributions. However, a key bottleneck is the design of a "ground" cost which should be adapted to the task under study. OT is parametrized by a distance between the features (the rows of the data matrix): the "ground cost". However, there is usually no straightforward choice of distance on the features, and supervised metric learning is not possible either, leaving only ad-hoc approaches. Unsupervised metric learning is thus a fundamental problem to enable data-driven applications of OT. In this paper, we propose for the first time a canonical answer by simultaneously computing an OT distance between the rows and between the columns of a data matrix. These distance matrices emerge naturally as positive singular vectors of the function mapping ground costs to pairwise OT distances. We provide criteria to ensure the existence and uniqueness of these singular vectors. We then introduce scalable computational methods to approximate them in high-dimensional settings, using entropic regularization and stochastic approximation. First, we extend the definition using entropic regularization, and show that in the large regularization limit it operates a principal component analysis dimensionality reduction. Next, we propose a stochastic approximation scheme and study its convergence. Finally, we showcase Wasserstein Singular Vectors in the context of computational biology on a high-dimensional single-cell RNA-sequencing dataset.
View on arXiv