A High Performance, Portable Distributed BLAS Implementation
P.E. Strazdins,
A High Performance, Portable Distributed BLAS Implementation
,
Proceedings of the Sixth Parallel Computing Workshop,
Fujitsu Parallel Computing Research Center,
Kawasaki, November 1996,
pages P2-K-1 -- P2-K-10.
Contents
Abstract
In this paper, we give a report on recent developments for the
Distributed BLAS (DBLAS) project. These include a powerful distributed
matrix representation which yields a simple interface to the DBLAS, and
the redesign the DBLAS algorithms terms of powerful `spread' and
`reduce' matrix communication operations for reasons of programmability.
The DBLAS codes achieve portability by supporting BLACS and various
forms of ApLib, including a locally developed `stride' ApLib (for the
Fujitsu AP1000/AP+), which is optimal for the `spread' and `reduce'
operations. Also, portability of ensuring high speed cell computation
across various cell architectures involved designing a cell BLAS
interface and the expression of the DBLAS algorithms in terms of a set
of tunable architecture-dependent parameters. Cell BLAS algorithms have
also been extended for SPARC 10 platforms, and required deeper software
pipelining for optimal performance. More extreme techniques were
required for the UltraSparc, with a preliminary matrix multiply kernel
achieved 250-300 MFLOPs
The DBLAS have been used to produce elegant parallel LU, Cholesky and QR
factorization algorithms, using the `distributed panels' technique, and
we report results for the QR factorization on the AP1000, and LU
factorization on the AP+. Here, this technique has yielded good
performance results, being up to 15% faster than traditional algorithms
for moderate matrix sizes. Performance problems include O(N) software
startup overheads which have a significant impact on dense linear
algebra algorithms on N x N matrices.