113
457

A Unified Framework for Approximating and Clustering Data

Abstract

Given a set FF of nn positive functions over a ground set XX, we consider the problem of computing xx^* that minimizes the expression fFf(x)\sum_{f\in F}f(x), over xXx\in X. A typical application is \emph{shape fitting}, where we wish to approximate a set PP of nn elements (say, points) by a shape xx from a (possibly infinite) family XX of shapes. Here, each point pPp\in P corresponds to a function ff such that f(x)f(x) is the distance from pp to xx, and we seek a shape xx that minimizes the sum of distances from each point in PP. In the kk-clustering variant, each xXx\in X is a tuple of kk shapes, and f(x)f(x) is the distance from pp to its closest shape in xx. Our main result is a unified framework for constructing {\em coresets} and {\em approximate clustering} for such general sets of functions. To achieve our results, we forge a link between the classic and well defined notion of ε\varepsilon-approximations from the theory of PAC Learning and VC dimension, to the relatively new (and not so consistent) paradigm of coresets, which are some kind of "compressed representation" of the input set FF. Using traditional techniques, a coreset usually implies an LTAS (linear time approximation scheme) for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only polylogarithmic space (i.e, in the streaming model). We show how to generalize the results of our framework for squared distances (as in kk-mean), distances to the qqth power, and deterministic constructions.

View on arXiv
Comments on this paper