Software Overhead and Blocking Issues in Parallel BLAS
P.E. Strazdins,
Software Overhead and Blocking Issues in Parallel BLAS
,
presented to the ScaLAPACK Conference call,
Computer Science Department, University of Tennessee, Knoxville,
March 17, 1998
Contents
Abstract
This paper compares the performance of three parallel BLAS
implementations on the Fujitsu AP1000 and AP+ parallel computers.
The comparison
is based on LU and LLT decomposition benchmarks, with a secondary parallel
rank-1 update benchmark to illustrate the extent of software overheads.
These show that algorithmic blocking and reduced software overheads
can enhance the performance of
LU decomposition on such an architecture by a factor of 1.70 (LU)
and 1.25 (LLT) for small matrices, and by a factor of 1.27 (LU) and 1.23
(LLT) for moderate sized matrices, on the AP1000.
On the AP+, the enhancement was a factor of 1.52 (LU)
and 1.20 (LLT) for small matrices, and by a factor of 1.27 (LU) and 1.31
(LLT) for moderate sized matrices.
For large matrices, the performance differences persists for the case of LU,
and reduces by half for the case of LLT.
For these reasons, it is important for parallel BLAS implementations to
have low software overheads and support algorithmic blocking effectively.