This paper gives a performance analysis of the All-Gather,
All-Reduce and Reduce-Scatter collective communication operations
on a Beowulf cluster.
This cluster has a contention-free switch-based
network with multiple network interface cards per node, permitting
overlapping of message transmission under certain circumstances.
As well as considering traditional algorithms developed previously
for parallel computers with vendor-specific networks, we also
examine simpler algorithms made up of repeated sub-operations,
such as broadcasts. We find that for the kind of network on
the Beowulf cluster, a somewhat different performance modeling
of the algorithm is required, and that some simple simulation
tools had to be developed in order to fully understand some
of the algorithms' performance.
Our results indicate that the LAM MPI implementations for
these operations may be significantly improved, and the algorithms
with data exchange and potential contention perform well on
the cluster. Furthermore, they indicate that algorithms permitting
message overlap are slightly favoured, with a new and simple
algorithm which modestly out-performs the best traditional algorithms
in the case of Reduce-Scatter. With the exception that the degree
of overlapping proved difficult to estimate, our performance
models fitted closely with the results, and together with the
simulation tools, permit a detailed understanding of the cluster's
communication pattern performance.