Even with the provision of highly tuned (vendor-supplied) serial BLAS implementations, attention must be given to cell computation speed issues, since serial BLAS does not supply a local matrix transpose routine (which is needed in many places), nor does it supply routines to adequately handle the triangular matrices which arise in the parallel context. We will describe the differing principles used on the UltraSPARC and VPP-300 nodes to optimise memory access patterns for the local matrix transpose operation and the large matrix multiply. The former uses partitioning methods which can yield a factor of 3-4 improvement of naive methods. The latter simultaneously optimizes usage of two levels of cache and the TLB, and out-performs the BLAS from the Sun Performance Library 1.2 by at least 15% on an 170 MHz UltraSPARC I. We will also give comparisons between the other functions in our version of UltraSPARC-tuned BLAS and the Performance Library 1.2.
Unlike the AP1000 and AP+, the AP3000 and VPP-300 must simulate row or column broadcasts using point-to-point messages. Also, their communication latencies to floating point speeds are much higher, and stride communication is not available. We describe and evaluate methods used in the DBLAS spread and reduce communication primitives to minimize communication costs for these machines. These include the use of pipelined broadcasts, and ring-shift methods for spread and reduce operations with multiple sources, both of which are significantly superior to binary tree based methods.