Matrix Factorization using Distributed Panels
on the Fujitsu AP1000
P.E. Strazdins.
Matrix Factorization using Distributed Panels on the
Fujitsu AP1000 .
IEEE First International Conference on
Algorithms And Architectures for Parallel Processing (ICA3PP-95), pp263-73,
Brisbane, April 1995.
Contents
Abstract
Dense linear algebra computations such as matrix factorization
require the technique of `block-partitioned algorithms' for
their efficient implementation on memory-hierarchy processors.
For scalar-based distributed memory multiprocessors, the register, cache and
off-processor memory levels of the memory hierarchy
all affect the optimal block-partition size for such algorithms.
Most studies on matrix factorization and similar algorithms have
assumed that the block-partition size or panel width for the algorithm,
w, to be the same as the matrix distribution block size, r,
where a rectangular block-cyclic matrix distribution is being employed.
Here the choice of w=r is essentially determined
by the off-processor memory level of the memory hierarchy,
with the value of w being a tradeoff between communication startup
overhead and load balance considerations.
In this paper, we re-examine this assumption in the context of
LU
and Cholesky
factorization of block-cyclic distributed matrices
on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000.
Here considerations of
the register and cache levels of the hierarchy require a large w.
We find that the choice of w, given w=r, leads to a tradeoff
between load balance and optimal use of register and cache levels of the hierarchy
(rather than communication startup), and that this tradeoff
substantially limits performance.
We then briefly describe `distributed panels' versions of these algorithms,
where generally w > r, which effectively diminishes this tradeoff
to an O(w/N) fraction of the overall computation,
where N is the matrix size.
Two variants of these versions, one with single rows/columns being
communicated, and one with single block rows/columns being communicated,
are analyzed for their load balance properties.
The results of the distributed panels versions of the algorithms
on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given,
which give significantly superior performance for distributed panels
versions over the w = r versions, with optimum performance
achieved for r ~ 1.