The Implementation of BLAS level 3 on the AP1000 - Preliminary Report
P.E. Strazdins and R.P. Brent,
Implementation of BLAS
level 3 the Fujitsu AP1000: Preliminary Report ,
Second ANU-Fujitsu CAP Workshop,
Australian National University, November 1991
Contents
Abstract
The Basic Linear Algebra Subprogram (BLAS) library
is widely used in many supercomputing applications,
and is used to implement more extensive linear algebra subroutine libraries,
such as LINPACK and LAPACK.
To take advantage of the high degree of parallelism of architectures such as
the Fujitsu AP1000, BLAS level 3 routines (matrix-matrix operations)
are proposed.
This project is concerned with implementing BLAS level 3 (BLAS-3) for single precision matrices on the
AP1000, with emphasis on obtaining the highest possible performance,
without significantly sacrificing numerical stability.
This paper discusses the techniques used to achieve this goal,
together with the underlying issues.
The most important techniques were the use of software pipelining and
loop unrolling for writing optimized assembler inner loops for
matrix inner and outer products, which were able to operate at more than 90%
and 70%, respectively, of the AP1000's theoretical peak performance.
The efficiency of cell communication using wormhole routing on the AP1000,
especially the row/column broadcast, enabled a sustained performance of
80 to 90% of the theoretical peak for all the BLAS-3 routines.
It also meant that many variations (using different communication schemes)
for matrix multiplication have more or less equivalent performance.
However, for future versions of the AP1000, optimizing communication must still
be considered.
Techniques for improving the performance for large matrices
(partitioning, to improve cache utilization) and for small matrices
(minimizing communication) are employed.
The latter has been developed for general rectangular AP1000 configurations.