63
8

Communication-avoiding Cholesky-QR2 for rectangular matrices

Abstract

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Θ(P1/6)\Theta(P^{1/6}) less interprocessor communication on PP processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.

View on arXiv
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.