Matrix multiplication on worsktations 99/02/01 (summary see: bench_bm_mm_all.txt) Matrix multiplication on RISC workstations. The speedups when using BLAS in the cases where the benchmark is absent can be inferred from the general pattern. Optimized compiler options: SGI Indy R5000/150 SC only. TABLE 1. Matrix multiplication using program p25.f, compiling as f77 -On ..., i.e., using compiler defaults. p25.f uses simple do-loops, the row index loop (vectorizable loop) being inside (locality). The MFLOPS entry is defined as 2 N**3 / (CPU time). ND: dimension parameter; N: actual matrix size. ======================================================================== Machine Opt. level ND N CPU (sec) MFLOPS ------------------------------------------------------------------------ HP 710 32 MB 1 400 400 63 2 (ur) 2, 3 17 8 ------------------------------------------------------------------------ SGI Indy R4600/133 SC 1 400 400 37 4 64 MB (atlas) 2, 3 13 10 ------------------------------------------------------------------------ HP 715/75 64 MB 1 400 400 39 3 (jupiter) 2, 3 12 11 ------------------------------------------------------------------------ SGI Indy R5000/150 SC 1 400 400 30 4 64 MB (uranus) 2, 3 8 16 f77 -O3 -r5000 -mips4 -n32 -LNO:ou=6 1.5 85 ------------------------------------------------------------------------ HP B132 32 MB 1 400 400 16 8 (phobos) *1 2, 3 4.5 28 ------------------------------------------------------------------------ Pentium Pro P6/200 1 400 400 6 21 64 MB (f9pc00) 2, 3 4.6 28 ------------------------------------------------------------------------ SGI O2 R5000/180 SC 1 400 400 128 MB (calypso) *2 3 7.2 18 ======================================================================== *1 Compiled on ur (HP-UX 9, HP 710), run on HP-UX 10 (f77 not installed). *2 Compiled on Indy R5000/150 (uranus). Increase of the CPU time if the summation loop is exchanged to be the innermost loop: SGI Indy R4600/133 SC: times 2 SGI Indy R5000/150 SC: times 2.6 HP 715/75: times 3.5 TABLE 2. As Table 1, but for program p25v.f, which calls DGEMM instead of using DO loops. Default f77 options; using precompiled system libraries where available (not compiling dgemm.f explicitly). ======================================================================== Machine Opt. level ND N CPU (sec) MFLOPS ------------------------------------------------------------------------ Pentium Pro P6/200 all 400 400 12.6 10 ? 64 MB (f9pc00) *1 ------------------------------------------------------------------------ SGI Indy R4600/100 PC all 400 400 19 32 MB (old atlas) *2 ------------------------------------------------------------------------ SGI Indy R4600/133 SC all 400 400 4.2 30 64 MB (atlas) *2 401 4.1 30 ------------------------------------------------------------------------ SGI Indy R5000/150 SC all 400 400 2.64 48 64 MB (uranus) *2 ------------------------------------------------------------------------ SGI O2 R5000/180 SC 400 400 2.40 53 128 MB (calypso) *3 ======================================================================== *1 f77 ... -lblas. *2 Non-interleaved memory. *3 Compiled on Indy R5000/150 (uranus). TABLE 3. Integer matrix multiplication using program p2i.f. ======================================================================== Machine Opt. level ND N CPU (sec) MI*1 MC*2 ------------------------------------------------------------------------ SGI Indy R5000/150 SC 1 400 400 64 MB (uranus) 3 5.9 22 0.15 ------------------------------------------------------------------------ SGI P. Chal. R8000/75 1 400 400 2000 MB (saturn) 3 4.3 30 0.4 ------------------------------------------------------------------------ HP B132 64 MB 1 400 400 (phobos) *1 4 7.8 16 0.12 ======================================================================== *1 Million integer operations per second. *2 Integer operations per clock cycle. R. Krivec