22
3

Subspace approximation with outliers

Abstract

The subspace approximation problem with outliers, for given nn points in dd dimensions x1,,xnRdx_{1},\ldots, x_{n} \in R^{d}, an integer 1kd1 \leq k \leq d, and an outlier parameter 0α10 \leq \alpha \leq 1, is to find a kk-dimensional linear subspace of RdR^{d} that minimizes the sum of squared distances to its nearest (1α)n(1-\alpha)n points. More generally, the p\ell_{p} subspace approximation problem with outliers minimizes the sum of pp-th powers of distances instead of the sum of squared distances. Even the case of robust PCA is non-trivial, and previous work requires additional assumptions on the input. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the (1α)n(1-\alpha)n inliers in the optimal solution are promised to lie exactly on a kk-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard. We show how to extend dimension reduction techniques and bi-criteria approximations based on sampling to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal kk-dimensional subspace summed over the optimal (1α)n(1-\alpha)n inliers is at least δ\delta times its squared-error summed over all nn points, for some 0<δ1α0 < \delta \leq 1 - \alpha. With this assumption, we give an efficient algorithm to find a subset of poly(k/ϵ)log(1/δ)loglog(1/δ)poly(k/\epsilon) \log(1/\delta) \log\log(1/\delta) points whose span contains a kk-dimensional subspace that gives a multiplicative (1+ϵ)(1+\epsilon)-approximation to the optimal solution. The running time of our algorithm is linear in nn and dd. Interestingly, our results hold even when the fraction of outliers α\alpha is large, as long as the obvious condition 0<δ1α0 < \delta \leq 1 - \alpha is satisfied.

View on arXiv
Comments on this paper