Matrix multiplication on PC-s 99/11/22 (summary see: bench_bm_mm_all.txt) This information is preliminary because the f77 options usage may not be exhaustive enough. CPU times can be compared to the workstation benchmarks performed WITHOUT precompiled BLAS libraries. The Table was generated by using a dummy timing routine and using the command "time " for timing. The CPU times differed if the multiplication loops were moved to a subroutine (program P30 and subroutine DMM0 in contrast to program P25). This was rather pronounced at N = 800 (29 to 48 sec) and N = 1600 (274 to 412 sec). We surely have to install another compiler or some such ... Table I-1. Matrix multiplication (REAL*8) on Urartu (Linux, Pentium II 300, 64 MB). Program p25.f, ND = 800, N = 800: theoretical lower limit on CPU time: 3.4 sec. ------------------------------------------------------------------------------- Program f77 options CPU time ------------------------------------------------------------------------------- p25 97 -O3 47 -O3 -funroll-loops 47 -O3 -funroll-all-loops 47 -O3 -fstrength-reduce 47 -O3 -funroll-loops -fstrength-reduce 47 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 29 -O3 -funroll-loops -fno-rerun-loop-opt 29 -O3 -funroll-all-loops -fno-rerun-loop-opt 29 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations 29 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fforce-mem 29 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fforce-addr 29 -O3 -funroll-loops -fno-rerun-loop-opt -fforce-mem -fforce-addr 29 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations -fno-move-all-movables 47 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations -fno-reduce-all-givs 35 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations -fcaller-saves 29 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations -frerun-cse-after-loop 29 p30, loops in main -O3 -funroll-loops -fno-rerun-loop-opt 34 p30, loops in subroutine dmm0 -O3 -funroll-loops -fno-rerun-loop-opt 48 * ------------------------------------------------------------------------------- * Unexplained fall of performance if the loops are moved into a subroutine. Table I-2. Matrix multiplication (REAL*8) on Urartu (Linux, Pentium II 300, 64 MB). Program p25.f, ND = 1600, N = 1600: theoretical lower limit on CPU time: 27.3 sec. ------------------------------------------------------------------------------- Program f77 options CPU time ------------------------------------------------------------------------------- p25 -O3 -funroll-loops -fno-rerun-loop-opt 274 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 272 p30, loops in main -O3 -funroll-loops -fno-rerun-loop-opt 278 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt p30, loops in subroutine dmm0 -O3 -funroll-loops -fno-rerun-loop-opt 412 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt ------------------------------------------------------------------------------- * Unexplained fall of performance if the loops are moved into a subroutine. Table II-1. As Table I-1., but dexter (Pentium II 450 MHz). Theoretical: 2.3 sec. ------------------------------------------------------------------------------- Program Options CPU time ------------------------------------------------------------------------------- p25 f77 -O3 27 -O5 27 -O3 -funroll-loops 15 * -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt -fexpensive-optimizations -frerun-cse-after-loop 16 pgf77 -fast -O4 -Munroll -Mvect -tp p6 *1 11.4 ------------------------------------------------------------------------------- * 15 sec --> 68 MFLOPS, 15% efficiency. Table II-2. Dexter (Pentium II 450 MHz). N = 1600. Theoretical: 18.2 sec. ------------------------------------------------------------------------------- Program Options CPU time ------------------------------------------------------------------------------- p25 f77 -O3 -O5 -O3 -funroll-loops 156 pgf77 -fast -O4 -Munroll -Mvect -tp p6 *1 101.8 ------------------------------------------------------------------------------- * 156 sec --> 52 MFLOPS, 12% efficiency. *1 Courtesy of D. Veberic.