A High Performance, Portable Distributed BLAS Implementation

P.E. Strazdins, A High Performance, Portable Distributed BLAS Implementation , Proceedings of the Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Research Center, Kawasaki, November 1996, pages P2-K-1 -- P2-K-10.

paper for PCW'96 (93K, 10 pages)
slides for PCW'96 (61K, 5 pages)

Abstract

In this paper, we give a report on recent developments for the Distributed BLAS (DBLAS) project. These include a powerful distributed matrix representation which yields a simple interface to the DBLAS, and the redesign the DBLAS algorithms terms of powerful `spread' and `reduce' matrix communication operations for reasons of programmability.

The DBLAS codes achieve portability by supporting BLACS and various forms of ApLib, including a locally developed `stride' ApLib (for the Fujitsu AP1000/AP+), which is optimal for the `spread' and `reduce' operations. Also, portability of ensuring high speed cell computation across various cell architectures involved designing a cell BLAS interface and the expression of the DBLAS algorithms in terms of a set of tunable architecture-dependent parameters. Cell BLAS algorithms have also been extended for SPARC 10 platforms, and required deeper software pipelining for optimal performance. More extreme techniques were required for the UltraSparc, with a preliminary matrix multiply kernel achieved 250-300 MFLOPs

The DBLAS have been used to produce elegant parallel LU, Cholesky and QR factorization algorithms, using the `distributed panels' technique, and we report results for the QR factorization on the AP1000, and LU factorization on the AP+. Here, this technique has yielded good performance results, being up to 15% faster than traditional algorithms for moderate matrix sizes. Performance problems include O(N) software startup overheads which have a significant impact on dense linear algebra algorithms on N x N matrices.

A High Performance, Portable Distributed BLAS Implementation

Contents

Abstract