Matrix Multiply on the UltraSparc I - large matrix performance
This report shows how to get
sustained 250-270 MFLOPs performance for large matrix multiply on an 170
MHz Ultra 1 (the Sun Performance Library 1.2 DGEMM() sustains 220-230
MFLOPS). This slide explains things
more succintly.
UltraSparc BLAS (UBLAS)
Current implementation (8/9/98) has:
- double precision routines only; compact code size
- !!! no error checking is performed !!!
- `rectangular routines' only
- dgemm, dger, dgemv, dcopy, dswap, idamax, ddot, dnrm2,
dasum
A
`DGEMM()-based Level 3 BLAS'
implementation, tuned for the UltraSPARC, can be used to
implement the triangular routines in terms of the UBLAS.
- best results can be obtained if the effective
cache size parameter is set to 256 KB
Performance results (on UltraSPARC I 170 MHz):
- for the Level 1 & 2 UBLAS
(note: Sun Perf Lib 1.2 often performs relatively better
than is indicated here for larger data sizes)
- 1000 x 1000 Matrix Factorization Performance
- using DBLAS-based codes compiled for a single
U170 processor using a blocking factor of 64.
- uses only the `rectangular BLAS routines'
- have rather high software overheads,
and memory access patterns may not be optimal
- results in MFLOPS (Sun Perf Lib 1.2 in ()'s)
- LU: 186 (176), LLT: 198 (131), QR: 207 (175)
To use these codes:
- first obtain a version of the BLAS (eg. from netlib), link that
into your application and THOROUGLY test your application.
- the re-link your application putting the file UBlas.a
archive *before* the other BLAS archive.
- re-run your application!
Disclaimer: Anything free comes with no guarantee!
- As this is the alpha release of the UBLAS, anyone using these
codes does so entirely at their own risk, and should *not*
assume that the code is free of bugs, major or minor.
- The author accepts *no* responsibility for any
loss or misfortune of others as a result of using these codes.
Having read the above Disclaimer,
click here to obtain the
UBLAS 10. alpha archive file (12K gzipped archive; compiled under Solaris 5.6).