A High Performance, Portable Distributed BLAS Implementation

P.E. Strazdins and H. Koesmarno, A High Performance Version of Parallel LAPACK: Preliminary Report , Proceedings of the Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Research Center, Kawasaki, November 1996, pages P2-J-1 -- P2-J-8.

Contents

Abstract

Dense linear algebra computations require the technique of `block-partitioned algorithms' for their efficient implementation on memory-hierarchy multi-processors. Most existing studies and libraries for this purpose, for example ScaLAPACK, assume that the block or panel width omega for these algorithms must be the same as the matrix distribution block size r. We present a project in progress to extend ScaLAPACK using the `distributed panels' technique, ie. to allow omega > r, which has the twofold advantages of improving performance for memory-hierarchy multiprocessors and yielding a simplified user interface. A key element of the project is a general Distributed BLAS implementation, which has been developed for primarily the Fujitsu AP series of multiprocessors but is now fully portable. Other key components are versions of the BLACS and BLACS libraries to /achieve high performance cell computation and communication respectively on the required target multiprocessor architectures.

Preliminary experiences and results using the Fujitsu AP1000 multiprocessor indicate that good performance improvements are possible for relatively little effort. Performance models indicate similar improvements can be expected on multiprocessors with relatively low communication costs and large (second-level) caches. Future work in the project include improving the DBLAS to 'cache' previously communicated data and the porting and testing of the codes on other multiprocessor platforms.