Coded Computation over Heterogeneous Clusters
In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: system failures, bottlenecks due to limited communication bandwidth, latency due to straggler nodes, etc. On the other hand, these systems enjoy abundance of computing and storage redundancy. There have been recent results that demonstrate the impact of coding for efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in homogeneous clusters. In this paper, we focus on general heterogeneous distributed computing clusters consisting of a variety of computing machines with different capabilities. We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation. In particular, we propose Heterogeneous Coded Matrix Multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that is provably asymptotically optimal. Moreover, we show that HCMM is unboundedly faster than uncoded schemes. We also provide numerical results demonstrating significant speedups of up to 90% and 35% for HCMM in comparison to the "uncoded" and "coded homogeneous" schemes, respectively. Furthermore, we carry out real experiments over Amazon EC2 clusters, where HCMM is found to be up to 17% faster than the uncoded scheme. In our worst case experiments with artificial stragglers, HCMM provides speedups of up to 12x over the uncoded scheme. Furthermore, we provide a generalization of the problem of optimal load allocation for heterogeneous clusters to scenarios with budget constraints. In the end, we discuss about the decoding complexity and describe how LDPC codes can be combined with HCMM in order to control the complexity of decoding.
View on arXiv