Accelerated Methods for Performing the LDLT Decomposition

Peter E. Strazdins. Accelerated methods for performing the LDLT decomposition, Proceedings of The 1999 International Conference on Computational Techniques and Applications, Australian National University, Canberra, Sep 1999.
Appears in ANZIAM J. 42 (C), pp C1328--C1355, Adelaide, Dec 2000

paper (link to ANZIAM)
slides for CTAC'99 talk (120K, 10 pages)

Abstract

This talk will describe the design, implementation and performance of parallel direct dense symmetric-indefinite matrix factorization algorithms. Such a computation is required for the large complex systems arising from electro-magnetic field analysis, such as are generated from the AccuField application. The primary target architecture for the algorithms is the Fujitsu AP3000, a distributed memory machine based on the UltraSPARC processor.
These algorithms use the Bunch-Kaufman diagonal pivoting method. The starting point is numerically identical to LAPACK _sytrf() algorithm; however, using the same (highly optimized) BLAS, it out-performs zsytrf() by ~ 15% for large matrices on the UltraSPARC family of processors, due to the optimization of low-level components. The first variant reduces symmetric interchanges, particularly important for parallel implementation, by taking into account the growth attained by any preceding columns that did not require any interchanges. However, it achieves the same growth bound.
The second variant uses a lookahead technique to predict whether interchanges are required over the next block column; if so, the block column can be eliminated using modified Cholesky methods, which can yield both computational and communication advantages. In either case, the variants always achieve their target blocking factor, regardless of whether interchanges are required, permitting optimal performance to be achieved.
These algorithms yield best performance gains on `weakly indefinite' matrices (ie. those which have generally large diagonal elements), which often arise from the AccuField application. The reduction of interchanges that they both afford also helps preserve the structure of banded matrices. On UltraSPARC processors, the first variant generally achieves a 1-2% performance gain; the second is faster still for large matrices by 2% for complex double precision and 6% for double precision.

However, larger performance gains are observed on distributed memory machines, where symmetric interchanges are relatively more expensive. On a 16 node 300MHz AP3000, the first variant achieved a 10-15% improvement for small-moderate sized matrices, decreasing to 7% for large matrices. For N = 10000, it achieved a sustained speed of 5.6 GFLOPs and a parallel speedup of 12.8.

Keywords

dense linear algebra, block cyclic decomposition, symmetric indefinite systems, LDLT decomposition, MDMT decomposition

Accelerated Methods for Performing the LDLT Decomposition

Contents

Abstract

Keywords