Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS

P.E. Strazdins. Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS . Proceedings of the Third Parallel Computing Workshop (PCW'94), Fujitsu Parallel Computing Research Center, Kawasaki, November 1994, pp P1-R-1 -- P1-R-7.

paper for PCW'94 (7 pages, 76KB)

Abstract

Given an implementation of Distributed BLAS Level 3 kernels, the parallelization of dense linear algebra libraries such as LAPACK can be easily achieved. In this paper, we briefly describe the implementation and performance on the AP1000 of Distributed BLAS Level 3 for the rectangular r x s block-cyclic matrix distribution. Then, the parallelization of the central matrix factorization and the tridiagonal reduction routines from LAPACK are described, where the algorithmic `blocking factor' w can be independent of the matrix distribution block size r. For scalar-based MIMD parallel processors with relatively low communication startup costs, such as the AP1000, it is found the optimum r and w generally satisfies w >> r with r ~ 1, differing from results published for vector-based parallel processors.

Prototyping Parallel LAPACK using Block-Cyclic Distributed BLAS

Contents

Abstract