Matrix multiplication on PC-s
                 	          99/11/22
                      (summary see: bench_bm_mm_all.txt)


This information is preliminary because the f77 options usage may not
be exhaustive enough.

CPU times can be compared to the workstation benchmarks performed WITHOUT
precompiled BLAS libraries.

The Table was generated by using a dummy timing routine and
using the command "time <executable>" for timing.

The CPU times differed if the multiplication loops were moved to a 
subroutine (program P30 and subroutine DMM0 in contrast to program P25).
This was rather pronounced at N = 800 (29 to 48 sec) and N = 1600
(274 to 412 sec). We surely have to install another compiler or some 
such ...


Table I-1. Matrix multiplication (REAL*8) on Urartu (Linux, Pentium II 300, 
64 MB). Program p25.f, ND = 800, N = 800: theoretical lower limit on 
CPU time: 3.4 sec.
-------------------------------------------------------------------------------
Program  f77 options                                               CPU time
-------------------------------------------------------------------------------
p25                                                                      97
         -O3                                                             47
         -O3 -funroll-loops                                              47
         -O3 -funroll-all-loops                                          47
         -O3                -fstrength-reduce                            47
         -O3 -funroll-loops -fstrength-reduce                            47
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt        29
	 -O3 -funroll-loops                   -fno-rerun-loop-opt        29
	 -O3 -funroll-all-loops               -fno-rerun-loop-opt        29
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fexpensive-optimizations                                  29

	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fforce-mem                                                29
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fforce-addr                                               29
	 -O3 -funroll-loops                   -fno-rerun-loop-opt
	      -fforce-mem -fforce-addr                                   29

	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fexpensive-optimizations -fno-move-all-movables           47
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fexpensive-optimizations -fno-reduce-all-givs             35
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fexpensive-optimizations -fcaller-saves                   29
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	      -fexpensive-optimizations -frerun-cse-after-loop           29

p30, loops in main
	 -O3 -funroll-loops                   -fno-rerun-loop-opt        34

p30, loops in subroutine dmm0
	 -O3 -funroll-loops                   -fno-rerun-loop-opt        48 *
-------------------------------------------------------------------------------
* Unexplained fall of performance if the loops are moved into a subroutine.


Table I-2. Matrix multiplication (REAL*8) on Urartu (Linux, Pentium II 300, 
64 MB). Program p25.f, ND = 1600, N = 1600: theoretical lower limit on 
CPU time: 27.3 sec.
-------------------------------------------------------------------------------
Program  f77 options                                               CPU time
-------------------------------------------------------------------------------
p25
	 -O3 -funroll-loops                   -fno-rerun-loop-opt       274
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt       272

p30, loops in main
	 -O3 -funroll-loops                   -fno-rerun-loop-opt       278
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt       

p30, loops in subroutine dmm0
	 -O3 -funroll-loops                   -fno-rerun-loop-opt       412
	 -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt      
-------------------------------------------------------------------------------
* Unexplained fall of performance if the loops are moved into a subroutine.


Table II-1. As Table I-1., but dexter (Pentium II 450 MHz). Theoretical: 
2.3 sec.
-------------------------------------------------------------------------------
Program  Options                                               CPU time
-------------------------------------------------------------------------------
p25      f77 -O3                                                         27
             -O5                                                         27
             -O3 -funroll-loops                                          15 *
	     -O3 -funroll-loops -fstrength-reduce -fno-rerun-loop-opt 
	          -fexpensive-optimizations -frerun-cse-after-loop       16

         pgf77 -fast -O4 -Munroll -Mvect -tp p6    *1                    11.4
-------------------------------------------------------------------------------
*  15 sec --> 68 MFLOPS, 15% efficiency.


Table II-2. Dexter (Pentium II 450 MHz). N = 1600. Theoretical: 18.2 sec.
-------------------------------------------------------------------------------
Program  Options                                               CPU time
-------------------------------------------------------------------------------
p25      f77 -O3                                                             
             -O5                                                             
             -O3 -funroll-loops                                         156

         pgf77 -fast -O4 -Munroll -Mvect -tp p6    *1                   101.8
-------------------------------------------------------------------------------
*  156 sec --> 52 MFLOPS, 12% efficiency.
*1  Courtesy of D. Veberic.