A Parallel Scan Algorithm in the Tensor Core Unit Model
- LRM

Abstract
We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size is a basic operation. In the -TCU model, we show that for inputs of size , the algorithm has depth at most and runs in time assuming tensor core units. Equivalently, the algorithm performs multiplications of square matrices of size s.
View on arXivComments on this paper