68
523

Testing the Manifold Hypothesis

Abstract

The hypothesis that high dimensional data tend to lie in the vicinity of a low dimensional manifold is the basis of manifold learning. The goal of this paper is to develop an algorithm (with accompanying complexity guarantees) for fitting a manifold to an unknown probability distribution supported in a separable Hilbert space, only using i.i.d samples from that distribution. More precisely, our setting is the following. Suppose that data are drawn independently at random from a probability distribution PP supported on the unit ball of a separable Hilbert space HH. Let G(d,V,τ)G(d, V, \tau) be the set of submanifolds of the unit ball of HH whose volume is at most VV and reach (which is the supremum of all rr such that any point at a distance less than rr has a unique nearest point on the manifold) is at least τ\tau. Let L(M,P)L(M, P) denote mean-squared distance of a random point from the probability distribution PP to MM. We obtain an algorithm that tests the manifold hypothesis in the following sense. The algorithm takes i.i.d random samples from PP as input, and determines which of the following two is true (at least one must be): (a) There exists MG(d,CV,τC)M \in G(d, CV, \frac{\tau}{C}) such that L(M,P)Cϵ.L(M, P) \leq C \epsilon. (b) There exists no MG(d,V/C,Cτ)M \in G(d, V/C, C\tau) such that L(M,P)ϵC.L(M, P) \leq \frac{\epsilon}{C}. The answer is correct with probability at least 1δ1-\delta.

View on arXiv
Comments on this paper