A High Performance, Portable Distributed BLAS Implementation
P.E. Strazdins and H. Koesmarno,
A High Performance Version of Parallel LAPACK:
Preliminary Report
,
Proceedings of the Sixth Parallel Computing Workshop,
Fujitsu Parallel Computing Research Center,
Kawasaki, November 1996, pages P2-J-1 -- P2-J-8.
Contents
Abstract
Dense linear algebra computations require the technique of
`block-partitioned algorithms' for their efficient implementation on
memory-hierarchy multi-processors. Most existing studies and libraries
for this purpose, for example ScaLAPACK, assume that the block or panel
width omega for these algorithms must be the same as the matrix
distribution block size r. We present a project in progress to extend
ScaLAPACK using the `distributed panels' technique, ie. to allow omega
> r, which has the twofold advantages of improving performance for
memory-hierarchy multiprocessors and yielding a simplified user
interface. A key element of the project is a general Distributed BLAS
implementation, which has been developed for primarily the Fujitsu AP
series of multiprocessors but is now fully portable. Other key
components are versions of the BLACS and BLACS libraries to /achieve
high performance cell computation and communication respectively on the
required target multiprocessor architectures.
Preliminary experiences and results using the Fujitsu AP1000
multiprocessor indicate that good performance improvements are possible
for relatively little effort. Performance models indicate similar
improvements can be expected on multiprocessors with relatively low
communication costs and large (second-level) caches. Future work in the
project include improving the DBLAS to 'cache' previously communicated
data and the porting and testing of the codes on other multiprocessor
platforms.