Transporting Distributed BLAS to the Fujitsu AP3000 and VPP-300

P.E. Strazdins, Transporting Distributed BLAS to the Fujitsu AP3000 and VPP-300 . Proceedings of the Eighth Parallel Computing Workshop (PCW'98), School of Computing, National University of Singapore, Singapore, pages 69--76, Sep 1998.

paper for PCW'98 (75K, 8 pages)
slides for PCW'98 (95K, 6 pages)

Abstract

The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP-300.
Even with the provision of highly tuned (vendor-supplied) serial BLAS implementations, attention must be given to cell computation speed issues, since serial BLAS does not supply a local matrix transpose routine (which is needed in many places), nor does it supply routines to adequately handle the triangular matrices which arise in the parallel context. We will describe the differing principles used on the UltraSPARC and VPP-300 nodes to optimise memory access patterns for the local matrix transpose operation and the large matrix multiply. The former uses partitioning methods which can yield a factor of 3-4 improvement of naive methods. The latter simultaneously optimizes usage of two levels of cache and the TLB, and out-performs the BLAS from the Sun Performance Library 1.2 by at least 15% on an 170 MHz UltraSPARC I. We will also give comparisons between the other functions in our version of UltraSPARC-tuned BLAS and the Performance Library 1.2.

Unlike the AP1000 and AP+, the AP3000 and VPP-300 must simulate row or column broadcasts using point-to-point messages. Also, their communication latencies to floating point speeds are much higher, and stride communication is not available. We describe and evaluate methods used in the DBLAS spread and reduce communication primitives to minimize communication costs for these machines. These include the use of pipelined broadcasts, and ring-shift methods for spread and reduce operations with multiple sources, both of which are significantly superior to binary tree based methods.

Transporting Distributed BLAS to the Fujitsu AP3000 and VPP-300

Contents

Abstract

Keywords