Appears in ANZIAM J. 42 (C), pp C1328--C1355, Adelaide, Dec 2000

- paper (link to ANZIAM)
- slides for CTAC'99 talk (120K, 10 pages)

These algorithms use the Bunch-Kaufman diagonal pivoting method. The starting point is numerically identical to LAPACK _sytrf() algorithm; however, using the same (highly optimized) BLAS, it out-performs zsytrf() by ~ 15% for large matrices on the UltraSPARC family of processors, due to the optimization of low-level components. The first variant reduces symmetric interchanges, particularly important for parallel implementation, by taking into account the growth attained by any preceding columns that did not require any interchanges. However, it achieves the same growth bound.

The second variant uses a lookahead technique to predict whether interchanges are required over the next block column; if so, the block column can be eliminated using modified Cholesky methods, which can yield both computational and communication advantages. In either case, the variants always achieve their target blocking factor, regardless of whether interchanges are required, permitting optimal performance to be achieved.

These algorithms yield best performance gains on `weakly indefinite' matrices (ie. those which have generally large diagonal elements), which often arise from the AccuField application. The reduction of interchanges that they both afford also helps preserve the structure of banded matrices. On UltraSPARC processors, the first variant generally achieves a 1-2% performance gain; the second is faster still for large matrices by 2% for complex double precision and 6% for double precision.

*
However, larger performance gains are observed on distributed memory machines, where
symmetric interchanges are relatively more expensive. On a 16 node 300MHz AP3000, the first
variant achieved a 10-15% improvement for small-moderate sized matrices, decreasing to 7% for
large matrices. For N = 10000, it achieved a sustained speed of 5.6 GFLOPs and a parallel speedup
of 12.8.
*