34
8

Communication-avoiding Cholesky-QR2 for rectangular matrices

Abstract

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We provide a more general parallelization of the CholeskyQR2 algorithm. This algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, achieving up to a factor of Θ(P1/6)\Theta(P^{1/6}) less interprocessor communication than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates that this QR factorization method can achieve better absolute performance and parallel scalability than ScaLAPACK's QR.

View on arXiv
Comments on this paper