Matrix multiplication on worsktations
                                99/02/01
                   (summary see: bench_bm_mm_all.txt)


Matrix multiplication on RISC workstations. The speedups when using 
BLAS in the cases where the benchmark is absent can be inferred from the 
general pattern.

Optimized compiler options: SGI Indy R5000/150 SC only.


TABLE 1.  Matrix multiplication using program p25.f, compiling as
f77 -On ..., i.e., using compiler defaults.  p25.f uses simple do-loops,
the row index loop (vectorizable loop) being inside (locality). The 
MFLOPS entry is defined as 2 N**3 / (CPU time). ND: dimension parameter;
N: actual matrix size.
========================================================================
Machine             Opt. level     ND       N        CPU (sec)   MFLOPS
------------------------------------------------------------------------
HP 710 32 MB           1          400     400           63          2
(ur)                   2, 3                             17          8
------------------------------------------------------------------------
SGI Indy R4600/133 SC  1          400     400           37          4
64 MB (atlas)          2, 3                             13         10
------------------------------------------------------------------------
HP 715/75 64 MB        1          400     400           39          3
  (jupiter)            2, 3                             12         11
------------------------------------------------------------------------
SGI Indy R5000/150 SC  1          400     400           30          4
64 MB (uranus)         2, 3                              8         16

f77 -O3 -r5000 -mips4 -n32 -LNO:ou=6                     1.5       85
------------------------------------------------------------------------
HP B132 32 MB          1          400     400           16          8
(phobos) *1            2, 3                              4.5       28
------------------------------------------------------------------------
Pentium Pro P6/200     1          400     400            6         21
64 MB (f9pc00)         2, 3                              4.6       28      
------------------------------------------------------------------------
SGI O2 R5000/180 SC    1          400     400           
128 MB (calypso) *2    3                                 7.2       18
========================================================================
*1 Compiled on ur (HP-UX 9, HP 710), run on HP-UX 10 (f77 not 
   installed). 
*2 Compiled on Indy R5000/150 (uranus).


Increase of the CPU time if the summation loop is exchanged to be the
innermost loop:
    SGI Indy R4600/133 SC:    times 2
    SGI Indy R5000/150 SC:    times 2.6
    HP 715/75:                times 3.5


TABLE 2.  As Table 1, but for program p25v.f, which calls DGEMM instead
of using DO loops. Default f77 options; using precompiled system 
libraries where available (not compiling dgemm.f explicitly).
========================================================================
Machine             Opt. level     ND       N        CPU (sec)   MFLOPS
------------------------------------------------------------------------
Pentium Pro P6/200     all        400     400           12.6       10 ?
64 MB (f9pc00) *1
------------------------------------------------------------------------
SGI Indy R4600/100 PC  all        400     400                      19
32 MB (old atlas) *2
------------------------------------------------------------------------
SGI Indy R4600/133 SC  all        400     400            4.2       30
64 MB (atlas) *2                          401            4.1       30
------------------------------------------------------------------------
SGI Indy R5000/150 SC  all        400     400            2.64      48
64 MB (uranus) *2
------------------------------------------------------------------------
SGI O2 R5000/180 SC               400     400            2.40      53
128 MB (calypso) *3
========================================================================
*1 f77 ... -lblas.
*2 Non-interleaved memory.
*3 Compiled on Indy R5000/150 (uranus).


TABLE 3.  Integer matrix multiplication using program p2i.f.
========================================================================
Machine             Opt. level     ND       N        CPU (sec) MI*1 MC*2
------------------------------------------------------------------------
SGI Indy R5000/150 SC  1          400     400            
64 MB (uranus)         3                                 5.9    22  0.15
------------------------------------------------------------------------
SGI P. Chal. R8000/75  1          400     400            
2000 MB (saturn)       3                                 4.3    30  0.4
------------------------------------------------------------------------
HP B132 64 MB          1          400     400           
(phobos) *1            4                                 7.8    16  0.12
========================================================================
*1  Million integer operations per second.
*2  Integer operations per clock cycle.


R. Krivec