Matrix Factorization using Distributed Panels on the Fujitsu AP1000

P.E. Strazdins. Matrix Factorization using Distributed Panels on the Fujitsu AP1000 . IEEE First International Conference on Algorithms And Architectures for Parallel Processing (ICA3PP-95), pp263-73, Brisbane, April 1995.

IEEE Explore entry
paper for ICA3PP-95 (11 pages, 624KB)
slides for ICA3PP-95 (13 pages, 112 KB)

Abstract

Dense linear algebra computations such as matrix factorization require the technique of `block-partitioned algorithms' for their efficient implementation on memory-hierarchy processors. For scalar-based distributed memory multiprocessors, the register, cache and off-processor memory levels of the memory hierarchy all affect the optimal block-partition size for such algorithms. Most studies on matrix factorization and similar algorithms have assumed that the block-partition size or panel width for the algorithm, w, to be the same as the matrix distribution block size, r, where a rectangular block-cyclic matrix distribution is being employed. Here the choice of w=r is essentially determined by the off-processor memory level of the memory hierarchy, with the value of w being a tradeoff between communication startup overhead and load balance considerations.

In this paper, we re-examine this assumption in the context of LU and Cholesky factorization of block-cyclic distributed matrices on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000. Here considerations of the register and cache levels of the hierarchy require a large w. We find that the choice of w, given w=r, leads to a tradeoff between load balance and optimal use of register and cache levels of the hierarchy (rather than communication startup), and that this tradeoff substantially limits performance. We then briefly describe `distributed panels' versions of these algorithms, where generally w > r, which effectively diminishes this tradeoff to an O(w/N) fraction of the overall computation, where N is the matrix size. Two variants of these versions, one with single rows/columns being communicated, and one with single block rows/columns being communicated, are analyzed for their load balance properties.

The results of the distributed panels versions of the algorithms on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given, which give significantly superior performance for distributed panels versions over the w = r versions, with optimum performance achieved for r ~ 1.

Matrix Factorization using Distributed Panels on the Fujitsu AP1000

Contents

Abstract