Matrix Multiply on The UltraSparc: Report Dec 96 ------------------------------------------------ 1. Introduction To improve the LINPACK Benchmark results on an AP3000, an efficient UltraSparc matrix multiply routine (C += A*B where C is N x N and A in N x K) is required. Typically, 128 <= N <= 8192, and K can be chosen for some optimal value, eg. K = 44, 64, 88 etc. Also the storage conventions of A and B can be chosen (eg. row-major or column-major). We will denote A^T (B^T) to denote column-major storage is used for A (B). The objective to sustain 250-270 MFLOPs for such a computation. Challenges involve minimizing cache misses and TLB misses. 2. Comparison of upd44d() with UpdateRectTr1() (small N) upd44d() is a Sparc V8a assembler routine written by Nagatsuka-san (Fujitsu Laboratories), derived from Professor Richard Brents version written for the Sparc-1. It can be used to perform C = A*B and uses a 4x4 outer loop unrolling. The inner (K) loop is unrolled two times, with a 9-10 cycle ldx-fmuld separation, and a 3 cycle fmuld-faddd separation. UpdateRectTr1() is a C program (compiled with gcc; the resulting V8 SPARC assembler must be post-processed by a stream editor (sed) to produce efficient V8a code) written by the author. It can be used to perform C = A^T*B and uses a 4x2 outer loop unrolling. The inner (K) loop is unrolled two times, with an 8 cycle ldx-fmuld separation, and a 3 cycle fmuld-faddd separation. The following table compares the speed (MFLOPs) of these routines When integrated into the same test program (MM.cell), for N=32: K: 32 124 128 upd44d(): 289 321 320 UpdateRectTr1(): 276 320 313 Thus upd44d() is slightly faster, especially for K=128, where the slightly longer ldx-fmuld separation or the 4x4 outer loop unrolling may give an extra advantage when there is some conflict in the (direct-mapped) top-level 16KB cache (this occurs most when K=128). 3. Timing Issues (for large computations) The test program (MM.cell) repeatedly performs and times a matrix multiply; it primarily reports the `best' speed (corresponding to the smallest time interval) for both wall time (via gethrtime()) and process user times (via times()), although speeds for all timings are also noted. Generally, wall time is more accurate for intervals < .5s. For small to medium sized computations (in terms of CPU time and memory), require light process loads (eg. < .1 load from the `uptime' command) for accurate measurements. Larger computations (eg. requiring > 1s CPU time and > 10 MB of data) require even lighter process loads (0.01 or less). Even then, small amounts of interference (ie. from other processes, network interrupts etc) seem to affect measurements by 25% or more. This interference probably involves the pollution of the 2nd level cache and/or TLB and TLB cache. Thus, these computations will require the system to effectively be with a single active user (this is quite difficult to arrange on the DCS ANU Ultra workstations at present). Another factor that could have spurious effects on performance for large computations is the mapping of virtual to physical memory, which is controlled by Solaris. This is because the 2nd level cache is direct-mapped with the real (rather than the virtual) address. 4. Large matrix multiply computations To minimize cache and TLB misses, it is important to have unit stride as much as possible. Thus upd44d() was modified by the author in order to perform C += A*B^T; the number of changes were not large and the resulting routine is called upd44dtr2(). For small N, this routine has the same performance of upd44d(). To perform C += A*B^T, the partitioning method is to divide B into block columns of width N'. Thus a (contiguous) K x N' portion of B^T remains in cache, while different rows of A and C are applied to it. Note that TLB misses have been measured to have ~ 1us overhead on an UltraSparc under Soloris 5.5. This is equivalent to ~ 300 floating point operations at peak speed. In the worst case, there could be a TLB miss for the update of each segment (of width N') of each row of C for the above algorithm. However, as there will be 2*K*N' floating point operations applied for each segment, the effect of the TLB miss should be small for appropriate values of N' and K. The performance in MFLOPs for various K are given below (the wall timer was generally used here). Here, the computation was performed over a column block of width N' of B only (ie. B is N' x K and C is N x N', but the row stride of C is still N). N \ K: 44 64 88 128 176 512 257 267 279 255 283 1024 257 258 278 256 279 1536 251 259 274 255 279 2048 251 252 273 249 275 2560 252 245 268 250 281 3072 247 254 275 248 276 3584 245 244 260 258 278 4096 246 248 267 225 243 5120 249 254 278 256 262 6144 242 240 268 225 244 (8192*) (239) (246) (271) (246) (260) (*due to memory limits, row stride of C = N', not N, here) Below are the results for the full matrix multiply computation. Again, these were performed when the system load was low as possible (but unfortunately not zero). The speed in MFLOPs using the wall timer is given on the left. The speed as measured by the process user time is on the right (a `-' is given when the timer interval is < 0.5s and hence the timing will be inaccurate). These are consistent with N<=2560 with the previous table. For N>=3072, performance drops suddenly, especially when the wall clock measurement is used. N \ N'=256,K=44 N'=32,K=88 N'=256,K=88 ----------------------------------------------------------- 512 247 / - 260 / - 274 / - 1024 242 / - 253 / 250 270 / 271 1536 244 / 244 238 / 238 267 / 273 2048 245 / 246 239 / 237 267 / 269 2560 240 / 240 236 / 235 266 / 266 3072 8 / 223 44 / 234 35 / 229 3584 6 / 222 16 / 230 13 / 220 4096 5 / 209 16 / 230 10 / 221 The columns for N'=32 and K=44 both use a smaller working set of pages, and hence should be less vulnerable to TLB misses, if these are occuring to any significant extent. The fact that these show a degradation in performance indicated that TLB misses are not the problem. For N>=3072, it seems loss of performance is due rather to the fact that the total memory requirements (>=75MB) of the program was too large to fit into primary memory and page swapping resulted. This caused the sudden large decrease in speed due when measured by the wall time, and a smaller loss of speed (probably due to cache and TLB pollution caused by the swapping) when measured by the user time. Note that the machine used (flash@anu.edu.au) has 128 MB memory, for which 89MB can be used for user processes. However, Solaris seems to be conservative in giving all available memory to a compute-bound job, especially if there are a number of other processes which are occasionally active. From using the Unix `top' facility, it can be confirmed that for N.=3072, only 50-60 MB of the data is resident in memory. Several variants of the above algorithm were tried to reduce the effect of page swapping. None were found to yield significant improvement. 5. Conclusions -------------- High speeds (250-270) MFLOPs are consistently sustainable for large matrix multiply computations on an UltraSparc I, provided that the system load is low and the data can comfortably fit within real memory. Best performance can be achieved for K=88, but K=44 is within 10% of this performance. The algorithm is relatively simple, and takes into account unit stride considerations to minimize cache and TLB misses. However, it requires the matrix B to be in column-major orientation, and thus cannot be just simply `plugged in' to the existing LINPACK codes for the AP3000. Peter Strazdins. 17/12/96