High Performance Dense Linear Systems Solution on a Beowulf Cluster

P.E. Strazdins. High Performance Dense Linear Systems Solution on a Beowulf Cluster , The 5th International Conference and Exhibition on High Performance Computing in the Asia-Pacific Region (HPC Asia 2001), Gold Coast, Sep, 2001

Contents

Abstract

In this paper, we describe techniques which can improve the performance of dense linear system solution, based on LU, LLT and QR factorizations, on distributed memory multiprocessors, including cluster computers.

The most important of these are refinements of the algorithmic blocking technique which reduce the bulk of its introduced communication startup costs and make the technique superior to storage blocking in terms of communication volume costs. These primarily rely on pipelined communication and the choice of a small storage block size.

Two other techniques, optimizing the memory behavior in multiple row swaps, and the coalescing of vector-matrix multiplies in QR, also afford modest improvements in storage blocking and serial performance.

Performance results on a 24 node Beowulf cluster with 550 MHz dual SMP Pentium III nodes connected by a COTS switch with 10 Mb/s links, show that algorithmic blocking generally improves performance by 15--30% or more for these computations over a large range of system sizes.

Keywords

dense linear algebra, block cyclic decomposition, storage blocking, algorithmic blocking, cluster computing.