The most important of these are refinements of the algorithmic blocking technique which reduce the bulk of its introduced communication startup costs and make the technique superior to storage blocking in terms of communication volume costs. These primarily rely on pipelined communication and the choice of a small storage block size.
Two other techniques, optimizing the memory behavior in multiple row swaps, and the coalescing of vector-matrix multiplies in QR, also afford modest improvements in storage blocking and serial performance.
Performance results on a 24 node Beowulf cluster with 550 MHz dual SMP Pentium III nodes connected by a COTS switch with 10 Mb/s links, show that algorithmic blocking generally improves performance by 15--30% or more for these computations over a large range of system sizes.