Large-Scale Discrete Fourier Transform on TPUs
In this work, we present a parallel algorithm for large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The algorithm is implemented in TensorFlow because of its rich set of functionalities for scientific computing and simplicity in realizing parallel computing algorithms. The DFT formulation is based on matrix multiplications between the input data and the Vandermonde matrix. This formulation takes full advantage of TPU's strength in matrix multiplications and allows nonuniformly sampled input data without modifying the implementation. For the parallel computing, both the input data and the Vandermonde matrix are partitioned and distributed across TPU cores. Through the data decomposition, the matrix multiplications are kept local within TPU cores and can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, requiring minimal communication time among TPU cores. Numerical examples are used to demonstrate the high parallel efficiency of the large-scale DFT on TPUs.
View on arXiv