22
55

Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under p\ell_p-Distances

Abstract

We present massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of nn input dd-dimensional vectors under Hamming, 1,2\ell_1, \ell_2 and \ell_\infty distances. All our algorithms run in O(logn)O(\log n) rounds of MPC for any fixed dd and achieve (1+ϵ)(1+\epsilon)-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for o(logn)o(\log n)-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on a variety of datasets exhibiting speedups of several orders of magnitude.

View on arXiv
Comments on this paper