SUMMARY TABLES OF MATRIX MULTIPLICATION FLOATING POINT PERFORMANCE Updated 05/12/24 Description Double precision (64-bit) matrix multiplication. Justification This test should yield close to maximum performance, but only if compiler/linker options are chosen correctly, and therefore tests compiler usage also. This test is correlated with the CFHHM benchmark since the CFHHM benchmark spends most of the CPU time on matrix operations. Summary 1. Only on a vector computer are hand-coded loops as effective as the DGEMM routine (see the R values). SGI Origin comes close. 2. On R10000 SGI systems and newer compilers, the hand-coded loops are almost as fast as DGEMM. On R8000 SGI systems, judicious use of compiler options on hand-coded loops reaches half the performance of DGEMM. (If only -O3 is used, the code is THREE MORE times slower.) Precompiled DGEMM reaches the largest efficiency of all the tested computers (92%). 3. On Alphas the hand-coded results vary wildly. Even the DGEMM results vary by a factor of 2 (see the R values). However, the crossbar model GS160 (01/2001) jumps to 82% efficiency with DGEMM. For the hand-coded test (14% efficiency), it is probably necessary to use better f77 options. 4. Hand coded Sun results are only preliminary. The DGEMM results have been kindly improved by R. van der Pas of Sun Microsystems. 5. Pentium 300 results depend strongly on whether the loops are in the main program or in a subroutine! Surely it's the compiler's fault! Table 1. Machine descriptions for use in Table 2. ------------------------------------------------------------------------------- Machine Description Compiler options =============================================================================== SGI PCL Power Challenge L, /usr/bin/f77 -r8000 -mips4 -64 -O3 R8000, 75 MHz, -LNO:ou=:cs1=16k:cs2=4m -TENV:X=4 4 MB L2, 2 GB -WK,-so=3,-ro=3,-o=5 -WK,-p=1 [-lcomplib.sgimath] (N = 800: n = 4, N = 1600: n = 8) SGI 2000 Origin 2000, f77 -r10000 -mips4 -64 -O3 R10000, 250 MHz -LNO:ou= SGI 300a Origin 2000, f77 -O3 -mips4 -n32 [-lscs] R12000, 300 MHz 8MB L2 (D. Veberic, 99/12/09, Neuchatel) Sun U60*1 Sun Ultra 60 360MHz f77 -fast -O4 -inline=foo -fsimple=2 -xchip=ultra2 -xarch=v8plusa -xcache=16/32/1:2048/64/1 Sun E65 Sun Enterprise 6500 f77 -fast -fsimple=2 400 MHz, 8 MB L2 -xtarget=ultra2 -xarch=v8plusa [-xlic_lib=sunperf] Sun E45*2 Sun Enterprise 4500 f77 -fast (N = 800) 336 MHz, 8 GB f77 -fast -O5 -fsimple=2 -pad -unroll=8 (N = 1600) [-xlic_lib=sunperf] Sun E45a Sun Enterprise 4500 p25: f77 -O5 -xarch=v9 400 MHz, 28 GB p25v: f77 -fast -xarch=v9 -xlic_lib=sunperf Parallel versions: f77 -fast -xarch=v9 -parallel [-xlic_lib=sunperf] (preliminary test) Convex C3860, 120 MFLOPS fc -O2 [-lveclib] A533s f77 ALPHA SX 533MHz f77 -O6 -funroll-all-loops A533 g77 ALPHA 164LX/533 g77 -O6 -funroll-all-loops (Linux); g77 -O6 + libblas.a (Linux) A533 f77 ALPHA 164LX/533 f77 -O5 -non_shared (DUX4.0 f77) + Exec; f77 -O5 -non_shared -ldxml (DUX4.0 f77 + DEC blas) + Exec A600 f77 Alpha 164LX/600 f77 -O5 -non_shared -ldxml (DUX4.0 f77 + DEC blas) GS160 Compaq GS160 731 MHz/4MB L2(?) f77 -fast (?) *3 API API UP2000 alpha 667 MHz linux fort -O5 -fast, -O5 -fast -lcxml (DGEMM) *3 P400 400 MHz Pentium g77 + ppro blas P300 300 MHz Pentium f77 -O3 -funroll-loops -fno-rerun-loop-opt P450 450 MHz Pentium f77 -O3 -funroll-loops -fno-rerun-loop-opt P450a 450 MHz Pentium pgf77 -fast -O4 -Munroll -Mvect -tp p6 *3 Indy5/150 SGI Indy R5000/150 SC f77 -O3 -r5000 -mips4 -n32 -LNO:ou=6:cs1=32k:cs2=512k -TENV:X=4 P6/200 Pentium Pro P6/200 ? P1000M 1000 MHz Mobile f77 -O3 -malign-double -funroll-loops Pentium (Dell 4100) -L/usr/lib/Linux_PII -lf77blas -latlas 256 MB (N ~< 4000) P2400 2400 MHz Pentium 4 g77 -Wall -O3 -march=pentium4 512 MB -funroll-all-loops -malign-double [-L/usr/local/Linux_P4SSE2/lib -llapack -lcblas -lf77blas -latlas] *4 P2400bg 2400 MHz Pentium 4 g77 -march=pentium4 -fstrength-reduce 2048 MB -malign-double -funroll-loops -O3 [-L/opt/atlas/lib -llapack -lcblas -lf77blas -latlas] *5 P2400bi 2400 MHz Pentium 4 ifc -O3 [ -pc32 | -pc64 ] -tpp7 -pad -ip 2048 MB -i4 -unroll -xW [-L/opt/atlas/lib -llapack -lcblas -lf77blas -latlas] *5 AO2200 Opteron 2200 MHz g77 -Wall -O3 -mcmodel=medium 32 GB (4-way SMP) -funroll-all-loops [-L/usr/local/lib64 -llapack -lcblas -lf77blas -latlas] *6 AO2200N Opteron 2200 MHz f95 -f77 -abi=64 -O4 32 GB (4-way SMP) -L/usr/lib/gcc-lib/x86_64-redhat-linux/ 3.2.3/ -lg2c [-L/usr/local/lib64 -llapack -lcblas -lf77blas -latlas] *7 AA2400 Athlon 2400 MHz g77 -Wall -O3 -malign-double (3400+), 1 GB *8 -funroll-all-loops [-L/usr/local/Linux_P4SSE2/lib -llapack -lcblas -lf77blas -latlas] AA2400D Athlon 2400 MHz g77 -Wall -O3 -mtune=k8 (3400+), 1 GB *9 -funroll-all-loops [-llapack -lcblas -latlas] ------------------------------------------------------------------------------- *1 Courtesy Sun Compiler Group (USA). *2 DGEMM results probably wrong because the xarch option was not used with sunperf. *3 D. Veberic. *4 urania 03/09/08, R. Krivec. *5 bender 03/09/08, R. Krivec. *6 eos 05/04/25, R. Krivec, 2 processors idle before test. *7 As *6 but one processor free. *8 urubu (Fedora Core 2 32-bit). *9 urubu (Gentoo 64-bit). Table 2. Matrix multiplication: p25.f, p25v.f, REAL*8, column index innermost (vector style); dimension specification ND, actual dimension N. R is the MFLOPS achieved divided by theoretical MFLOPS. This test is intended for evaluation of relatively simple compiler options. This Table presents the summary of all measurements. For separate N, see Tables 2-1, 2-2. ------------------------------------------------------------------------------- Machine ND N Program Hand coded loops (p25) DGEMM (p25v) size CPU time MFLOPS R CPU time MFLOPS R =============================================================================== SGI PCL 800 800 15 MB 6.4 160 0.53 4.1 250 0.83 1600 1600 61 71 115 0.38 33.8 241 0.80 4000 4000 384 607 210 0.70 6000 4000 512 250 0.83 6000 6000 864 1723 251 0.84 SGI 2000 800 800 3 340 0.68 1600 1600 20.2 405 0.81 SGI 300a 1600 800 2 500 0.83 1.9 544 0.91 1600 1600 16.5 496 0.83 14.9 550 0.92 Sun U60 800 800 2.6 395 0.55 1600 1600 23.7 346 0.48 Sun E65 800 800 6.7 153 0.19 1.7 602 0.75 *1) 1.5 690 0.86 1600 1600 203.3 40 0.05 13.7 598 0.75 *1) 11.5 710 0.89 Sun E45 800 800 13 79 0.12 4 260 0.38 1600 1600 200 41 0.06 36 228 0.34 Sun E45a 800 800 9.3 110 0.14 1.5 660 0.82 (1 proc) 1600 1600 202 41 0.05 11.8 693 0.86 4000 4000 179.8 711 0.88 8000 4000 182 703 0.88 8000 8000 1464 699 0.87 -parallel: 800 800 1 proc 10 102 0.12 2 proc 6 170 0.10 4 proc 3 340 0.10 8 proc <2 14 proc <2 1600 1600 1 proc 83 99 12 682 0.85 2 proc 42 195 7 1200 0.75 4 proc 23 356 4 8 proc 11 744 <3 14 proc 6 1400 0.12 4000 4000 1 proc 2 proc 4 proc 49 2600 0.81 8 proc 28 4600 0.71 14 proc 17 7500 0.66 Convex 800 800 10.7 96 0.8 9.8 104 0.87 1600 1600 81.4 101 0.84 4000 4000 1276 100 0.84 A533s f77 400 400 1.6 80 0.08 800 800 19 54 0.05 1600 1600 157 52 0.05 A533 g77 1600 1600 215 38 0.04 A533 f77 24.4 335 0.31 A533 g77 11.1 738 0.70 A533 f77 22.2 369 0.35 A600 f77 24.3 337 0.28 GS160 40.2 204 0.14 6.8 1205 0.82 API 13.8 594 0.44 7.6 1078 0.81 P450a 800 800 11 90 0.20 1600 1600 102 81 0.18 P450 800 800 15 68 0.15 1600 1600 156 52 0.12 P400 37 221 0.55 P300 800 800 29 35 0.12 48 21 0.07 *2) 1600 1600 274 30 0.10 412 20 0.07 *2) Indy5/150 400 400 1.5 85 2.64 48 Indy5/150 800 800 13 79 P6/200 4.6 28 0.14 12.6 10(?) P1000M 4000 800 5.4 1.4 726 0.73 1600 11.2 731 0.73 4000 174 735 0.74 P2400 800 800 2 496 0.20 0.32 3158 1.32 1600 1600 16 497 0.20 2.6 3172 1.32 P2400bg 800 800 2.5 409 0.17 0.32 3200 1.33 P2400bi 800 800 2.3 455 0.19 0.34 3011 1.25 AO2200 800 800 3.1 331 0.28 3657 0.83 1600 1600 61 23 356 0.08 2.1 3846 0.87 4000 4000 384 372 344 0.07 34 3815 0.86 8000 8000 1536 3009 340 0.07 268 3818 0.86 1600016000 6144 2134 3837 0.87 AO2200N 800 800 2.8 368 0.08 1600 1600 61 22 368 0.08 4000 4000 384 346 370 0.08 AA2400 800 800 2.0 510 0.1 0.32 3181 0.66 *3) 1600 1600 61 16 591 0.1 2.5 3285 0.68 AA2400D 800 800 1.9 538 0.1 0.26 3970 0.82 *4) 1600 1600 61 15 541 0.1 2.0 4020 0.83 ------------------------------------------------------------------------------- *1) A version of p25 tuned for 8 MB L2. R. van der Pas, 99/06/18. *2) Program p30 and subroutine dmm0: 48 sec instead of 29 sec for N = 800, 412 sec instead of 274 sec for N = 1600, if the DO loops are moved into a subroutine (dmm0)! *3) Bad DGEMM efficiency possibly due to 32-bit Atlas installation (64-bit gives error, 32-bit FC2). This is the Pentium P4 SSE2 version of Atlas. Athlon in 32-bit mode is still slightly faster than Pentium itself (see: P2400). *4) Gentoo 64-bit Linux and Gentoo's Atlas library (emerge sci-libs/lapack-atlas) is not as good (in efficiency) as RHEL3 (AO2200*). Table 2-1. Subset of Table 2: N = 800. Ordered by DGEMM CPU times. ------------------------------------------------------------------------------- Machine ND N Program Hand coded loops (p25) DGEMM (p25v) size CPU time MFLOPS R CPU time MFLOPS R =============================================================================== AO3400 800 800 3.1 331 0.28 3657 0.83 P2400 800 800 2 496 0.20 0.32 3158 1.32 P2400bg 800 800 2.5 409 0.17 0.32 3200 1.33 Sun E65 800 800 15 MB 6.7 153 0.19 *1) 1.5 690 0.86 SGI 300a 1600 800 2 500 0.83 1.9 544 0.91 Sun E45 800 800 13 79 0.12 4 260 0.38 SGI PCL 800 800 6.4 160 0.53 4.1 250 0.83 Convex 800 800 10.7 96 0.8 9.8 104 0.87 Indy5/150 800 800 13 79 A533s f77 800 800 19 54 0.05 P450a 800 800 11 90 0.20 ------------------------------------------------------------------------------- *1) A version of DGEMM tuned for 8 MB L2. R. van der Pas, 99/06/18. Table 2-2. Subset of Table 2: N = 1600. Ordered by DGEMM CPU times. ------------------------------------------------------------------------------- Machine ND N Program Hand coded loops (p25) DGEMM (p25v) size CPU time MFLOPS R CPU time MFLOPS R =============================================================================== AO3400 1600 1600 61 23 356 0.08 2.1 3846 0.87 P2400 1600 1600 16 497 0.20 2.6 3172 1.32 GS160 40.2 204 0.14 6.8 1205 0.82 API 13.8 594 0.44 7.6 1078 0.81 A533 g77 61 MB 11.1 738 0.70 SGI 400 (extrap.) (11.2) (730) Sun E65 1600 1600 203.3 40 0.05 *1) 11.5 710 0.89 SGI 300a 1600 1600 16.5 496 0.83 14.9 550 0.92 A533 f77 22.2 369 0.35 A600 f77 24.3 337 0.28 SGI PCL 1600 1600 71 115 0.38 33.8 241 0.80 Sun E45 1600 1600 200 41 0.06 36 228 0.34 P400 37 221 0.55 Convex 1600 1600 81.4 101 0.84 A533 f77 24.4 335 0.31 A533 g77 1600 1600 215 38 0.04 A533s f77 157 52 0.05 P450a 1600 1600 102 81 0.18 ------------------------------------------------------------------------------- *1) A version of DGEMM tuned for 8 MB L2. R. van der Pas, 99/06/18. Table 3. As Table 2, but p2i.f, INTEGER*4. This test uses only basic options. ------------------------------------------------------------------------------- Machine ND N Program Hand coded loops (p2i) size CPU time MI*1 MC*2 =============================================================================== SGI PCL 400 400 4.3 30 0.4 *3 Indy5/150 5.9 22 0.15 SGI PCL 800 800 34 *3 Indy5/150 Sun E45 18 *3 ------------------------------------------------------------------------------- *1 Million integer operations per second. *2 Integer operations per clock cycle. *3 Almost independent of compiler options. ---- R. Krivec.