Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS
P.E. Strazdins.
Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS .
Proceedings of the Third Parallel Computing Workshop (PCW'94),
Fujitsu Parallel Computing Research Center,
Kawasaki, November 1994, pp P1-R-1 -- P1-R-7.
Contents
Abstract
Given an implementation of Distributed BLAS Level 3
kernels, the parallelization of dense linear algebra
libraries such as LAPACK can be easily achieved.
In this paper, we briefly describe the implementation
and performance on the AP1000 of Distributed BLAS Level 3
for the rectangular r x s block-cyclic matrix distribution.
Then, the parallelization of the central matrix factorization
and the tridiagonal reduction routines from LAPACK
are described, where the algorithmic `blocking factor' w
can be independent of the matrix distribution block size r.
For scalar-based MIMD parallel processors
with relatively low communication startup costs,
such as the AP1000, it is found the optimum
r and w generally satisfies w >> r with r ~ 1,
differing from results published for vector-based parallel processors.