Matrix Multiply on The UltraSparc: Report Dec 96
		------------------------------------------------

1. Introduction

To improve the LINPACK Benchmark results on an AP3000, an efficient
UltraSparc matrix multiply routine (C += A*B where C is N x N and A in N
x K) is required.  Typically, 128 <= N <= 8192, and K can be chosen for
some optimal value,
	eg.  K = 44, 64, 88 etc. 

Also the storage conventions of A and B can be chosen (eg.  row-major or
column-major).  We will denote A^T (B^T) to denote column-major storage
is used for A (B). 

The objective to sustain 250-270 MFLOPs for such a computation. 
Challenges involve minimizing cache misses and TLB misses. 


2. Comparison of upd44d() with UpdateRectTr1() (small N)

upd44d() is a Sparc V8a assembler routine written by Nagatsuka-san
(Fujitsu Laboratories), derived from Professor Richard Brents version
written for the Sparc-1.  It can be used to perform C = A*B and uses a
4x4 outer loop unrolling.  The inner (K) loop is unrolled two times,
with a 9-10 cycle ldx-fmuld separation, and a 3 cycle fmuld-faddd
separation. 

UpdateRectTr1() is a C program (compiled with gcc; the resulting V8
SPARC assembler must be post-processed by a stream editor (sed) to produce
efficient V8a code) written by the author.  It can be used to perform C
= A^T*B and uses a 4x2 outer loop unrolling.  The inner (K) loop is
unrolled two times, with an 8 cycle ldx-fmuld separation, and a 3
cycle fmuld-faddd separation. 

The following table compares the speed (MFLOPs) of these routines When
integrated into the same test program (MM.cell), for N=32:
	K:			32	124	128
	upd44d():		289	321	320
	UpdateRectTr1():	276	320	313

Thus upd44d() is slightly faster, especially for K=128, where the
slightly longer ldx-fmuld separation or the 4x4 outer loop unrolling may
give an extra advantage when there is some conflict in the
(direct-mapped) top-level 16KB cache (this occurs most when K=128). 


3. Timing Issues (for large computations)

The test program (MM.cell) repeatedly performs and times a matrix
multiply; it primarily reports the `best' speed (corresponding to the
smallest time interval) for both wall time (via gethrtime()) and process
user times (via times()), although speeds for all timings are also
noted.  Generally, wall time is more accurate for intervals < .5s. 

For small to medium sized computations (in terms of CPU time and
memory), require light process loads (eg.  < .1 load from the `uptime'
command) for accurate measurements.  Larger computations (eg.  requiring
> 1s CPU time and > 10 MB of data) require even lighter process loads
(0.01 or less).  Even then, small amounts of interference (ie.  from
other processes, network interrupts etc) seem to affect measurements by
25% or more.  This interference probably involves the pollution of the
2nd level cache and/or TLB and TLB cache.  Thus, these computations will
require the system to effectively be with a single active user (this is
quite difficult to arrange on the DCS ANU Ultra workstations at
present). 

Another factor that could have spurious effects on performance for large
computations is the mapping of virtual to physical memory, which is
controlled by Solaris.  This is because the 2nd level cache is
direct-mapped with the real (rather than the virtual) address. 


4. Large matrix multiply computations

To minimize cache and TLB misses, it is important to have unit stride
as much as possible. Thus upd44d() was modified by the author in order
to perform C += A*B^T; the number of changes were not large and the
resulting routine is called upd44dtr2(). For small N, this routine has
the same performance of upd44d().

To perform C += A*B^T, the partitioning method is to divide B into block
columns of width N'.  Thus a (contiguous) K x N' portion of B^T remains
in cache, while different rows of A and C are applied to it. 

Note that TLB misses have been measured to have ~ 1us overhead on an
UltraSparc under Soloris 5.5.  This is equivalent to ~ 300 floating
point operations at peak speed.  In the worst case, there could be a TLB
miss for the update of each segment (of width N') of each row of C for
the above algorithm.  However, as there will be 2*K*N' floating point
operations applied for each segment, the effect of the TLB miss should
be small for appropriate values of  N' and K. 

The performance in MFLOPs for various K are given below (the wall timer
was generally used here).  Here, the computation was performed over a
column block of width N' of B only (ie.  B is N' x K and C is N x N',
but the row stride of C is still N). 

N \ K:	44	64 	88	128	176

512	257	267	279	255	283
1024	257	258	278	256	279
1536	251	259	274	255	279
2048	251	252	273	249	275
2560	252	245	268	250	281
3072	247	254	275	248	276
3584	245	244	260	258	278
4096	246	248	267	225	243
5120	249	254	278	256	262
6144	242	240	268	225	244
(8192*)	(239)	(246)	(271) 	(246)	(260)

(*due to memory limits, row stride of C = N', not N, here)

Below are the results for the full matrix multiply computation.  Again,
these were performed when the system load was low as possible (but
unfortunately not zero).  The speed in MFLOPs using the wall timer is
given on the left.  The speed as measured by the process user time is on
the right (a `-' is given when the timer interval is < 0.5s and hence
the timing will be inaccurate).  These are consistent with N<=2560 with
the previous table.  For N>=3072, performance drops suddenly, especially
when the wall clock measurement is used. 

N \	N'=256,K=44	N'=32,K=88	N'=256,K=88	
-----------------------------------------------------------
512	247 / -		260 / -		274 / -	
1024	242 / - 	253 / 250	270 / 271	
1536	244 / 244 	238 / 238	267 / 273		
2048	245 / 246	239 / 237	267 / 269	
2560	240 / 240	236 / 235	266 / 266	
3072	  8 / 223	44  / 234	35  / 229	
3584	  6 / 222	16  / 230	13  / 220	
4096	  5 / 209	16  / 230	10  / 221	

The columns for N'=32 and K=44 both use a smaller working set of pages, 
and hence should be less vulnerable to TLB misses, if these are occuring
to any significant extent. The fact that these show a degradation in
performance indicated that TLB misses are not the problem.

For N>=3072, it seems loss of performance is due rather to the fact
that the total memory requirements (>=75MB) of the program was too large
to fit into primary memory and page swapping resulted.  This caused the
sudden large decrease in speed due when measured by the wall time, and a
smaller loss of speed (probably due to cache and TLB pollution caused
by the swapping) when measured by the user time. 

Note that the machine used (flash@anu.edu.au) has 128 MB memory, for
which 89MB can be used for user processes.  However, Solaris seems to be
conservative in giving all available memory to a compute-bound job,
especially if there are a number of other processes which are
occasionally active.  From using the Unix `top' facility, it can be
confirmed that for N.=3072, only 50-60 MB of the data is resident in
memory. 

Several variants of the above algorithm were tried to reduce the effect
of page swapping.  None were found to yield significant improvement. 


5. Conclusions
--------------

High speeds (250-270) MFLOPs are consistently sustainable for large
matrix multiply computations on an UltraSparc I, provided that the
system load is low and the data can comfortably fit within real memory. 
Best performance can be achieved for K=88, but K=44 is within 10% of
this performance. 

The algorithm is relatively simple, and takes into account unit stride
considerations to minimize cache and TLB misses.  However, it requires
the matrix B to be in column-major orientation, and thus cannot be just
simply `plugged in' to the existing LINPACK codes for the AP3000. 


Peter Strazdins. 17/12/96