Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under -Distances

Abstract
We present massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of input -dimensional vectors under Hamming, and distances. All our algorithms run in rounds of MPC for any fixed and achieve -approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for -round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on a variety of datasets exhibiting speedups of several orders of magnitude.
View on arXivComments on this paper