In this paper, we present techniques and algorithms to improve the
performance of various communication patterns on message-passing
platforms where, for reasons of safety, user-level communications must
be buffered in (special) memory on both the send and the receive. These
algorithms can not only minimize message copying but overlap the copying
to/from the special memory with the actual transfer, enabling full
bandwidth to be achieved. These patterns include tree broadcast and
reductions, (ring-based) multiple broadcasts and reductions, pipelined
broadcast and buffered point-to-point sends. In each case, the messages
may have a simple stride. All of these patterns are used in dense
linear algebra applications, although they are also used in many other
contexts.
These algorithms are implemented and their performance evaluated on the
Fujitsu AP3000, a message passing multicomputer having many
characteristics of the cluster model. Some aspects, such as the
performance characteristics of the special memory, are specific to the
AP3000; however, the algorithms still apply to any platform using a similar
mode of user level communications. Worthwhile performance increases
are obtained, especially for patterns involving moderate-large number of
processors.