58
2

Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Abstract

Learning meaningful representations of complex objects that can be seen through multiple (k3k\geq 3) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to kk views, either by instantiating 12k(k1)\tfrac12k(k-1) loss-pairs, or by using reduced embeddings, following a \textit{one vs. average-of-rest} strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all kk views. Given a batch of nn points, each seen as a kk-tuple of views subsequently transformed into kk embeddings, our loss contrasts the cost of matching these nn ground-truth kk-tuples with the MM-OT polymatching cost, which seeks nn optimally arranged kk-tuples chosen within these n×kn\times k vectors. While the exponential complexity O(nkO(n^k) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., k=36k=3\sim 6 views using mini-batches of size 64 12864~\sim128. Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.